Search | arXiv e-print repository

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Authors: Chao Wen, Jacqueline Staub, Adish Singla

Abstract: Large language and multimodal models have shown remarkable successes on various benchmarks focused on specific skills such as general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesi… ▽ More Large language and multimodal models have shown remarkable successes on various benchmarks focused on specific skills such as general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment. The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment, each requiring a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models, and provide an in-depth analysis of the models' expertise across different skill dimensions. We will publicly release the benchmark for future research on program synthesis in visual programming. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2405.18291 [pdf, other]

FedSAC: Dynamic Submodel Allocation for Collaborative Fairness in Federated Learning

Authors: Zihui Wang, Zheng Wang, Lingjuan Lyu, Zhaopeng Peng, Zhicheng Yang, Chenglu Wen, Rongshan Yu, Cheng Wang, Xiaoliang Fan

Abstract: Collaborative fairness stands as an essential element in federated learning to encourage client participation by equitably distributing rewards based on individual contributions. Existing methods primarily focus on adjusting gradient allocations among clients to achieve collaborative fairness. However, they frequently overlook crucial factors such as maintaining consistency across local models and… ▽ More Collaborative fairness stands as an essential element in federated learning to encourage client participation by equitably distributing rewards based on individual contributions. Existing methods primarily focus on adjusting gradient allocations among clients to achieve collaborative fairness. However, they frequently overlook crucial factors such as maintaining consistency across local models and catering to the diverse requirements of high-contributing clients. This oversight inevitably decreases both fairness and model accuracy in practice. To address these issues, we propose FedSAC, a novel Federated learning framework with dynamic Submodel Allocation for Collaborative fairness, backed by a theoretical convergence guarantee. First, we present the concept of "bounded collaborative fairness (BCF)", which ensures fairness by tailoring rewards to individual clients based on their contributions. Second, to implement the BCF, we design a submodel allocation module with a theoretical guarantee of fairness. This module incentivizes high-contributing clients with high-performance submodels containing a diverse range of crucial neurons, thereby preserving consistency across local models. Third, we further develop a dynamic aggregation module to adaptively aggregate submodels, ensuring the equitable treatment of low-frequency neurons and consequently enhancing overall model accuracy. Extensive experiments conducted on three public benchmarks demonstrate that FedSAC outperforms all baseline methods in both fairness and model accuracy. We see this work as a significant step towards incentivizing broader client participation in federated learning. The source code is available at https://github.com/wangzihuixmu/FedSAC. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted by KDD'24

arXiv:2405.13403 [pdf, other]

Adaptive Wireless Image Semantic Transmission and Over-The-Air Testing

Authors: Jiarun Ding, Peiwen Jiang, Chao-Kai Wen, Shi **

Abstract: Semantic communication has undergone considerable evolution due to the recent rapid development of artificial intelligence (AI), significantly enhancing both communication robustness and efficiency. Despite these advancements, most current semantic communication methods for image transmission pay little attention to the differing importance of objects and backgrounds in images. To address this iss… ▽ More Semantic communication has undergone considerable evolution due to the recent rapid development of artificial intelligence (AI), significantly enhancing both communication robustness and efficiency. Despite these advancements, most current semantic communication methods for image transmission pay little attention to the differing importance of objects and backgrounds in images. To address this issue, we propose a novel scheme named ASCViT-JSCC, which utilizes vision transformers (ViTs) integrated with an orthogonal frequency division multiplexing (OFDM) system. This scheme adaptively allocates bandwidth for objects and backgrounds in images according to the importance order of different parts determined by object detection of you only look once version 5 (YOLOv5) and feature points detection of scale invariant feature transform (SIFT). Furthermore, the proposed scheme adheres to digital modulation standards by incorporating quantization modules. We validate this approach through an over-the-air (OTA) testbed named intelligent communication prototype validation platform (ICP) based on a software-defined radio (SDR) and NVIDIA embedded kits. Our findings from both simulations and practical measurements show that ASCViT-JSCC significantly preserves objects in images and enhances reconstruction quality compared to existing methods. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.02173 [pdf, other]

Task Synthesis for Elementary Visual Programming in XLogoOnline Environment

Authors: Chao Wen, Ahana Ghosh, Jacqueline Staub, Adish Singla

Abstract: In recent years, the XLogoOnline programming platform has gained popularity among novice learners. It integrates the Logo programming language with visual programming, providing a visual interface for learning computing concepts. However, XLogoOnline offers only a limited set of tasks, which are inadequate for learners to master the computing concepts that require sufficient practice. To address t… ▽ More In recent years, the XLogoOnline programming platform has gained popularity among novice learners. It integrates the Logo programming language with visual programming, providing a visual interface for learning computing concepts. However, XLogoOnline offers only a limited set of tasks, which are inadequate for learners to master the computing concepts that require sufficient practice. To address this, we introduce XLogoSyn, a novel technique for synthesizing high-quality tasks for varying difficulty levels. Given a reference task, XLogoSyn can generate practice tasks at varying difficulty levels that cater to the varied needs and abilities of different learners. XLogoSyn achieves this by combining symbolic execution and constraint satisfaction techniques. Our expert study demonstrates the effectiveness of XLogoSyn. We have also deployed synthesized practice tasks into XLogoOnline, highlighting the educational benefits of these synthesized practice tasks. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: Accepted as a paper at the AIED'24 conference in the late-breaking results track

arXiv:2404.19134 [pdf, other]

Evaluating Deep Clustering Algorithms on Non-Categorical 3D CAD Models

Authors: Siyuan Xiang, Chin Tseng, Congcong Wen, Deshana Desai, Yifeng Kou, Binil Starly, Daniele Panozzo, Chen Feng

Abstract: We introduce the first work on benchmarking and evaluating deep clustering algorithms on large-scale non-categorical 3D CAD models. We first propose a workflow to allow expert mechanical engineers to efficiently annotate 252,648 carefully sampled pairwise CAD model similarities, from a subset of the ABC dataset with 22,968 shapes. Using seven baseline deep clustering methods, we then investigate t… ▽ More We introduce the first work on benchmarking and evaluating deep clustering algorithms on large-scale non-categorical 3D CAD models. We first propose a workflow to allow expert mechanical engineers to efficiently annotate 252,648 carefully sampled pairwise CAD model similarities, from a subset of the ABC dataset with 22,968 shapes. Using seven baseline deep clustering methods, we then investigate the fundamental challenges of evaluating clustering methods for non-categorical data. Based on these challenges, we propose a novel and viable ensemble-based clustering comparison approach. This work is the first to directly target the underexplored area of deep clustering algorithms for 3D shapes, and we believe it will be an important building block to analyze and utilize the massive 3D shape collections that are starting to appear in deep geometric computing. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.16493 [pdf, other]

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection

Authors: Hai Wu, Shijia Zhao, Xun Huang, Chenglu Wen, Xin Li, Cheng Wang

Abstract: The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detecto… ▽ More The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at https://github.com/hailanyi/CPD. △ Less

Submitted 26 June, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.15131 [pdf, other]

Optimizing Multi-Touch Textile and Tactile Skin Sensing Through Circuit Parameter Estimation

Authors: Bo Ying Su, Yuchen Wu, Chengtao Wen, Changliu Liu

Abstract: Tactile and textile skin technologies have become increasingly important for enhancing human-robot interaction and allowing robots to adapt to different environments. Despite notable advancements, there are ongoing challenges in skin signal processing, particularly in achieving both accuracy and speed in dynamic touch sensing. This paper introduces a new framework that poses the touch sensing prob… ▽ More Tactile and textile skin technologies have become increasingly important for enhancing human-robot interaction and allowing robots to adapt to different environments. Despite notable advancements, there are ongoing challenges in skin signal processing, particularly in achieving both accuracy and speed in dynamic touch sensing. This paper introduces a new framework that poses the touch sensing problem as an estimation problem of resistive sensory arrays. Utilizing a Regularized Least Squares objective function which estimates the resistance distribution of the skin. We enhance the touch sensing accuracy and mitigate the ghosting effects, where false or misleading touches may be registered. Furthermore, our study presents a streamlined skin design that simplifies manufacturing processes without sacrificing performance. Experimental outcomes substantiate the effectiveness of our method, showing 26.9% improvement in multi-touch force-sensing accuracy for the tactile skin. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.11536 [pdf, other]

FedPFT: Federated Proxy Fine-Tuning of Foundation Models

Authors: Zhaopeng Peng, Xiaoliang Fan, Yufan Chen, Zheng Wang, Shirui Pan, Chenglu Wen, Ruisheng Zhang, Cheng Wang

Abstract: Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuni… ▽ More Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise compression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations-layer-level and neuron-level-before and during FL fine-tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theoretical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vision) demonstrate the superiority of FedPFT. △ Less

Submitted 28 April, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by IJCAI'24

arXiv:2404.04783 [pdf, other]

Fourier Transform-based Wavenumber Domain 3D Imaging in RIS-aided Communication Systems

Authors: Yixuan Huang, Jie Yang, Wankai Tang, Chao-Kai Wen, Shi **

Abstract: Radio imaging is rapidly gaining prominence in the design of future communication systems, with the potential to utilize reconfigurable intelligent surfaces (RISs) as imaging apertures. Although the sparsity of targets in three-dimensional (3D) space has led most research to adopt compressed sensing (CS)-based imaging algorithms, these often require substantial computational and memory burdens. Dr… ▽ More Radio imaging is rapidly gaining prominence in the design of future communication systems, with the potential to utilize reconfigurable intelligent surfaces (RISs) as imaging apertures. Although the sparsity of targets in three-dimensional (3D) space has led most research to adopt compressed sensing (CS)-based imaging algorithms, these often require substantial computational and memory burdens. Drawing inspiration from conventional Fourier transform (FT)-based imaging methods, our research seeks to accelerate radio imaging in RIS-aided communication systems. To begin, we introduce a two-stage wavenumber domain 3D imaging technique: first, we modify RIS phase shifts to recover the equivalent channel response from the user equipment to the RIS array, subsequently employing traditional FT-based wavenumber domain methods to produce target images. We also determine the diffraction resolution limits of the system through k-space analysis, taking into account factors including system bandwidth, transmission direction, operating frequency, and the angle subtended by the RIS. Addressing the challenge of limited pilots in communication systems, we unveil an innovative algorithm that merges the strengths of both FT- and CS-based techniques by substituting the expansive sensing matrix with FT-based operators. Our simulation outcomes confirm that our proposed FT-based methods achieve high-quality images while demanding few time, memory, and communication resources. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: 16 pages, 11 figures, submitted to IEEE for possible publication

arXiv:2404.00795 [pdf, other]

Towards Practical Requirement Analysis and Verification: A Case Study on Software IP Components in Aerospace Embedded Systems

Authors: Zhi Ma, Cheng Wen, Jie Su, Ming Zhao, Bin Yu, Xu Lu, Cong Tian

Abstract: IP-based software design is a crucial research field that aims to improve efficiency and reliability by reusing complex software components known as intellectual property (IP) components. To ensure the reusability of these components, particularly in security-sensitive software systems, it is necessary to analyze the requirements and perform formal verification for each IP component. However, conv… ▽ More IP-based software design is a crucial research field that aims to improve efficiency and reliability by reusing complex software components known as intellectual property (IP) components. To ensure the reusability of these components, particularly in security-sensitive software systems, it is necessary to analyze the requirements and perform formal verification for each IP component. However, converting the requirements of IP components from natural language descriptions to temporal logic and subsequently conducting formal verification demands domain expertise and non-trivial manpower. This paper presents a case study on software IP components derived from aerospace embedded systems, with the objective of automating the requirement analysis and verification process. The study begins by employing Large Language Models to convert unstructured natural language into formal specifications. Subsequently, three distinct verification techniques are employed to ascertain whether the source code meets the extracted temporal logic properties. By doing so, five real-world IP components from the China Academy of Space Technology (CAST) have been successfully verified. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2404.00762 [pdf, other]

Enchanting Program Specification Synthesis by Large Language Models using Static Analysis and Program Verification

Authors: Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, Cong Tian

Abstract: Formal verification provides a rigorous and systematic approach to ensure the correctness and reliability of software systems. Yet, constructing specifications for the full proof relies on domain expertise and non-trivial manpower. In view of such needs, an automated approach for specification synthesis is desired. While existing automated approaches are limited in their versatility, i.e., they ei… ▽ More Formal verification provides a rigorous and systematic approach to ensure the correctness and reliability of software systems. Yet, constructing specifications for the full proof relies on domain expertise and non-trivial manpower. In view of such needs, an automated approach for specification synthesis is desired. While existing automated approaches are limited in their versatility, i.e., they either focus only on synthesizing loop invariants for numerical programs, or are tailored for specific types of programs or invariants. Programs involving multiple complicated data types (e.g., arrays, pointers) and code structures (e.g., nested loops, function calls) are often beyond their capabilities. To help bridge this gap, we present AutoSpec, an automated approach to synthesize specifications for automated program verification. It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof. It is driven by static analysis and program verification, and is empowered by large language models (LLMs). AutoSpec addresses the practical challenges in three ways: (1) driving \name by static analysis and program verification, LLMs serve as generators to generate candidate specifications, (2) programs are decomposed to direct the attention of LLMs, and (3) candidate specifications are validated in each round to avoid error accumulation during the interaction with LLMs. In this way, AutoSpec can incrementally and iteratively generate satisfiable and adequate specifications. The evaluation shows its effectiveness and usefulness, as it outperforms existing works by successfully verifying 79% of programs through automatic specification synthesis, a significant improvement of 1.592x. It can also be successfully applied to verify the programs in a real-world X509-parser project. △ Less

Submitted 2 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.19501 [pdf, other]

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

Authors: Ming Yan, Yan Zhang, Shuqiang Cai, Shuqi Fan, Xincheng Lin, Yudi Dai, Siqi Shen, Chenglu Wen, Lan Xu, Yuexin Ma, Cheng Wang

Abstract: Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, solely using these modalities or a combination of them may not be adequate for HPE, particularly for complex and fast movements. For holistic human motion understanding… ▽ More Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, solely using these modalities or a combination of them may not be adequate for HPE, particularly for complex and fast movements. For holistic human motion understanding, we present RELI11D, a high-quality multimodal human motion dataset involves LiDAR, IMU system, RGB camera, and Event camera. It records the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours of synchronized LiDAR point clouds, IMU measurement data, RGB videos and Event steams. Through extensive experiments, we demonstrate that the RELI11D presents considerable challenges and opportunities as it contains many rapid and complex motions that require precise location. To address the challenge of integrating different modalities, we propose LEIR, a multimodal baseline that effectively utilizes LiDAR Point Cloud, Event stream, and RGB through our cross-attention fusion strategy. We show that LEIR exhibits promising results for rapid motions and daily motions and that utilizing the characteristics of multiple modalities can indeed improve HPE performance. Both the dataset and source code will be released publicly to the research community, fostering collaboration and enabling further exploration in this field. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: CVPR2024, Project website: http://www.lidarhumanmotion.net/reli11d/

arXiv:2403.11764 [pdf, other]

RIS-aided Single-frequency 3D Imaging by Exploiting Multi-view Image Correlations

Authors: Yixuan Huang, Jie Yang, Chao-Kai Wen, Shi **

Abstract: Retrieving range information in three-dimensional (3D) radio imaging is particularly challenging due to the limited communication bandwidth and pilot resources. To address this issue, we consider a reconfigurable intelligent surface (RIS)-aided uplink communication scenario, generating multiple measurements through RIS phase adjustment. This study successfully realizes 3D single-frequency imaging… ▽ More Retrieving range information in three-dimensional (3D) radio imaging is particularly challenging due to the limited communication bandwidth and pilot resources. To address this issue, we consider a reconfigurable intelligent surface (RIS)-aided uplink communication scenario, generating multiple measurements through RIS phase adjustment. This study successfully realizes 3D single-frequency imaging by exploiting the near-field multi-view image correlations deduced from user mobility. We first highlight the significance of considering anisotropy in multi-view image formation by investigating radar cross-section properties and diffraction resolution limits. We then propose a novel model for joint multi-view 3D imaging that incorporates occlusion effects and anisotropic scattering. These factors lead to slow image support variation and smooth coefficient evolution, which are mathematically modeled as Markov processes. Based on this model, we employ the Expectation Maximization-Turbo-Generalized Approximate Message Passing algorithm for joint multi-view single-frequency 3D imaging with limited measurements. Simulation results reveal the superiority of joint multi-view imaging in terms of enhanced imaging ranges, accuracies, and anisotropy characterization compared to single-view imaging. Combining adjacent observations for joint multi-view imaging enables a reduction in the measurement overhead by 80%. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 16 pages, 12 figures, accepted by IEEE Transactions on Communications

arXiv:2403.00729 [pdf, other]

Can Transformers Capture Spatial Relations between Objects?

Authors: Chuan Wen, Dinesh Jayaraman, Yang Gao

Abstract: Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to… ▽ More Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple "RelatiViT" architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 21 pages, 8 figures, ICLR 2024

arXiv:2402.18969 [pdf, other]

OHTA: One-shot Hand Avatar via Data-driven Implicit Priors

Authors: Xiaozheng Zheng, Chao Wen, Zhuo Su, Zeran Xu, Zhaohu Li, Yang Zhao, Zhou Xue

Abstract: In this paper, we delve into the creation of one-shot hand avatars, attaining high-fidelity and drivable hand representations swiftly from a single image. With the burgeoning domains of the digital human, the need for quick and personalized hand avatar creation has become increasingly critical. Existing techniques typically require extensive input data and may prove cumbersome or even impractical… ▽ More In this paper, we delve into the creation of one-shot hand avatars, attaining high-fidelity and drivable hand representations swiftly from a single image. With the burgeoning domains of the digital human, the need for quick and personalized hand avatar creation has become increasingly critical. Existing techniques typically require extensive input data and may prove cumbersome or even impractical in certain scenarios. To enhance accessibility, we present a novel method OHTA (One-shot Hand avaTAr) that enables the creation of detailed hand avatars from merely one image. OHTA tackles the inherent difficulties of this data-limited problem by learning and utilizing data-driven hand priors. Specifically, we design a hand prior model initially employed for 1) learning various hand priors with available data and subsequently for 2) the inversion and fitting of the target identity with prior knowledge. OHTA demonstrates the capability to create high-fidelity hand avatars with consistent animatable quality, solely relying on a single image. Furthermore, we illustrate the versatility of OHTA through diverse applications, encompassing text-to-avatar conversion, hand editing, and identity latent space manipulation. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR 2024. Project page: https://zxz267.github.io/OHTA

arXiv:2402.18493 [pdf, other]

Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection

Authors: Xun Huang, Hai Wu, Xin Li, Xiaoliang Fan, Chenglu Wen, Cheng Wang

Abstract: LiDAR-based 3D object detection models have traditionally struggled under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain… ▽ More LiDAR-based 3D object detection models have traditionally struggled under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain simulation method, termed DRET, that unifies Dynamics and Rainy Environment Theory to provide a cost-effective means of expanding the available realistic rain data for 3D detection training. Furthermore, we present a Sunny-to-Rainy Knowledge Distillation (SRKD) approach to enhance 3D detection under rainy conditions. Extensive experiments on the WaymoOpenDataset large-scale dataset show that, when combined with the state-of-the-art DSVT model and other classical 3D detectors, our proposed framework demonstrates significant detection accuracy improvements, without losing efficiency. Remarkably, our framework also improves detection capabilities under sunny conditions, therefore offering a robust solution for 3D detection regardless of whether the weather is rainy or sunny △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI2024

arXiv:2402.09546 [pdf, other]

How Secure Are Large Language Models (LLMs) for Navigation in Urban Environments?

Authors: Congcong Wen, Jiazhao Liang, Shuaihang Yuan, Hao Huang, Yi Fang

Abstract: In the field of robotics and automation, navigation systems based on Large Language Models (LLMs) have recently shown impressive performance. However, the security aspects of these systems have received relatively less attention. This paper pioneers the exploration of vulnerabilities in LLM-based navigation models in urban outdoor environments, a critical area given the technology's widespread app… ▽ More In the field of robotics and automation, navigation systems based on Large Language Models (LLMs) have recently shown impressive performance. However, the security aspects of these systems have received relatively less attention. This paper pioneers the exploration of vulnerabilities in LLM-based navigation models in urban outdoor environments, a critical area given the technology's widespread application in autonomous driving, logistics, and emergency services. Specifically, we introduce a novel Navigational Prompt Suffix (NPS) Attack that manipulates LLM-based navigation models by appending gradient-derived suffixes to the original navigational prompt, leading to incorrect actions. We conducted comprehensive experiments on an LLMs-based navigation model that employs various LLMs for reasoning. Our results, derived from the Touchdown and Map2Seq street-view datasets under both few-shot learning and fine-tuning configurations, demonstrate notable performance declines across three metrics in the face of both white-box and black-box attacks. These results highlight the generalizability and transferability of the NPS Attack, emphasizing the need for enhanced security in LLM-based navigation systems. As an initial countermeasure, we propose the Navigational Prompt Engineering (NPE) Defense strategy, concentrating on navigation-relevant keywords to reduce the impact of adversarial suffixes. While initial findings indicate that this strategy enhances navigational safety, there remains a critical need for the wider research community to develop stronger defense methods to effectively tackle the real-world challenges faced by these systems. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2401.11445 [pdf, other]

Towards Non-Robocentric Dynamic Landing of Quadrotor UAVs

Authors: Li-Yu Lo, Boyang Li, Chih-Yung Wen, Ching-Wei Chang

Abstract: In this work, we propose a dynamic landing solution without the need for onboard exteroceptive sensors and an expensive computation unit, where all localization and control modules are carried out on the ground in a non-inertial frame. Our system starts with a relative state estimator of the aerial robot from the perspective of the landing platform, where the state tracking of the UAV is done thro… ▽ More In this work, we propose a dynamic landing solution without the need for onboard exteroceptive sensors and an expensive computation unit, where all localization and control modules are carried out on the ground in a non-inertial frame. Our system starts with a relative state estimator of the aerial robot from the perspective of the landing platform, where the state tracking of the UAV is done through a set of onboard LED markers and an on-ground camera; the state is expressed geometrically on manifold, and is returned by Iterated Extended Kalman filter (IEKF) algorithm. Subsequently, a motion planning module is developed to guide the landing process, formulating it as a minimum jerk trajectory by applying the differential flatness property. Considering visibility and dynamic constraints, the problem is solved using quadratic programming, and the final motion primitive is expressed through piecewise polynomials. Through a series of experiments, the applicability of this approach is validated by successfully landing 18 cm x 18 cm quadrotor on a 43 cm x 43 cm platform, exhibiting performance comparable to conventional methods. Finally, we provide comprehensive hardware and software details to the research community for future reference. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2401.11439 [pdf, other]

General Flow as Foundation Affordance for Scalable Robot Learning

Authors: Chengbo Yuan, Chuan Wen, Tong Zhang, Yang Gao

Abstract: We address the challenge of acquiring real-world manipulation skills with a scalable framework.Inspired by the success of large-scale auto-regressive prediction in Large Language Models (LLMs), we hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize fl… ▽ More We address the challenge of acquiring real-world manipulation skills with a scalable framework.Inspired by the success of large-scale auto-regressive prediction in Large Language Models (LLMs), we hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize flow, which represents the future trajectories of 3D points on objects of interest, as an ideal prediction target in robot learning. To exploit scalable data resources, we turn our attention to cross-embodiment datasets. We develop, for the first time, a language-conditioned prediction model directly from large-scale RGBD human video datasets. Our predicted flow offers actionable geometric and physics guidance, thus facilitating stable zero-shot skill transfer in real-world scenarios.We deploy our method with a policy based on closed-loop flow prediction. Remarkably, without any additional training, our method achieves an impressive 81% success rate in human-to-robot skill transfer, covering 18 tasks in 6 scenes. Our framework features the following benefits: (1) scalability: leveraging cross-embodiment data resources; (2) universality: multiple object categories, including rigid, articulated, and soft bodies; (3) stable skill transfer: providing actionable guidance with a small inference domain-gap. These lead to a new pathway towards scalable general robot learning. Data, code, and model weights will be made publicly available. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2401.00025 [pdf, other]

Any-point Trajectory Modeling for Policy Learning

Authors: Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, Pieter Abbeel

Abstract: Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the… ▽ More Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}. △ Less

Submitted 16 February, 2024; v1 submitted 28 December, 2023; originally announced January 2024.

Comments: 16 pages, 13 figures

arXiv:2312.14495 [pdf, other]

Beam Foreseeing in Millimeter-Wave Systems with Situational Awareness: Fundamental Limits via Cramér-Rao Lower Bound

Authors: Wan-Ting Shih, Chao-Kai Wen, Shang-Ho Tsai, Shi **, Chau Yuen

Abstract: Millimeter-wave (mmWave) networks offer the potential for high-speed data transfer and precise localization, leveraging large antenna arrays and extensive bandwidths. However, these networks are challenged by significant path loss and susceptibility to blockages. In this study, we delve into the use of situational awareness for beam prediction within the 5G NR beam management framework. We introdu… ▽ More Millimeter-wave (mmWave) networks offer the potential for high-speed data transfer and precise localization, leveraging large antenna arrays and extensive bandwidths. However, these networks are challenged by significant path loss and susceptibility to blockages. In this study, we delve into the use of situational awareness for beam prediction within the 5G NR beam management framework. We introduce an analytical framework based on the Cramér-Rao Lower Bound, enabling the quantification of 6D position-related information of geometric reflectors. This includes both 3D locations and 3D orientation biases, facilitating accurate determinations of the beamforming gain achievable by each reflector or candidate beam. This framework empowers us to predict beam alignment performance at any given location in the environment, ensuring uninterrupted wireless access. Our analysis offers critical insights for choosing the most effective beam and antenna module strategies, particularly in scenarios where communication stability is threatened by blockages. Simulation results show that our approach closely approximates the performance of an ideal, Oracle-based solution within the existing 5G NR beam management system. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: 16 pages, 10 figures; IEEE Transactions on Wireless Communications

arXiv:2312.14453 [pdf, other]

Hybrid Aerodynamics-Based Model Predictive Control for a Tail-Sitter UAV

Authors: Bailun Jiang, Boyang Li, Ching-Wei Chang, Chih-Yung Wen

Abstract: It is challenging to model and control a tail-sitter unmanned aerial vehicle (UAV) because its blended wing body generates complicated nonlinear aerodynamic effects, such as wing lift, fuselage drag, and propeller-wing interactions. We therefore devised a hybrid aerodynamic modeling method and model predictive control (MPC) design for a quadrotor tail-sitter UAV. The hybrid model consists of the N… ▽ More It is challenging to model and control a tail-sitter unmanned aerial vehicle (UAV) because its blended wing body generates complicated nonlinear aerodynamic effects, such as wing lift, fuselage drag, and propeller-wing interactions. We therefore devised a hybrid aerodynamic modeling method and model predictive control (MPC) design for a quadrotor tail-sitter UAV. The hybrid model consists of the Newton-Euler equation, which describes quadrotor dynamics, and a feedforward neural network, which learns residual aerodynamic effects. This hybrid model exhibits high predictive accuracy at a low computational cost and was used to implement hybrid MPC, which optimizes the throttle, pitch angle, and roll angle for position tracking. The controller performance was validated in real-world experiments, which obtained a 57% tracking error reduction compared with conventional nonlinear MPC. External wind disturbance was also introduced and the experimental results confirmed the robustness of the controller to these conditions. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2312.08664 [pdf, other]

SPEAL: Skeletal Prior Embedded Attention Learning for Cross-Source Point Cloud Registration

Authors: Kezheng Xiong, Maoji Zheng, Qingshan Xu, Chenglu Wen, Siqi Shen, Cheng Wang

Abstract: Point cloud registration, a fundamental task in 3D computer vision, has remained largely unexplored in cross-source point clouds and unstructured scenes. The primary challenges arise from noise, outliers, and variations in scale and density. However, neglected geometric natures of point clouds restricts the performance of current methods. In this paper, we propose a novel method termed SPEAL to le… ▽ More Point cloud registration, a fundamental task in 3D computer vision, has remained largely unexplored in cross-source point clouds and unstructured scenes. The primary challenges arise from noise, outliers, and variations in scale and density. However, neglected geometric natures of point clouds restricts the performance of current methods. In this paper, we propose a novel method termed SPEAL to leverage skeletal representations for effective learning of intrinsic topologies of point clouds, facilitating robust capture of geometric intricacy. Specifically, we design the Skeleton Extraction Module to extract skeleton points and skeletal features in an unsupervised manner, which is inherently robust to noise and density variances. Then, we propose the Skeleton-Aware GeoTransformer to encode high-level skeleton-aware features. It explicitly captures the topological natures and inter-point-cloud skeletal correlations with the noise-robust and density-invariant skeletal representations. Next, we introduce the Correspondence Dual-Sampler to facilitate correspondences by augmenting the correspondence set with skeletal correspondences. Furthermore, we construct a challenging novel large-scale cross-source point cloud dataset named KITTI CrossSource for benchmarking cross-source point cloud registration methods. Extensive quantitative and qualitative experiments are conducted to demonstrate our approach's superiority and robustness on both cross-source and same-source datasets. To the best of our knowledge, our approach is the first to facilitate point cloud registration with skeletal geometric priors. △ Less

Submitted 3 March, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI2024

arXiv:2312.08591 [pdf, other]

Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints

Authors: Muxin Zhang, Qiao Feng, Zhuo Su, Chao Wen, Zhou Xue, Kun Li

Abstract: 3D human generation is increasingly significant in various applications. However, the direct use of 2D generative methods in 3D generation often results in losing local details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce Joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human g… ▽ More 3D human generation is increasingly significant in various applications. However, the direct use of 2D generative methods in 3D generation often results in losing local details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce Joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly, ensuring both global structure and local details. To achieve this, we employ the Fourier occupancy field (FOF) representation, enabling the direct generation of 3D shapes as preliminary results with 2D generative models. With the proposed high-frequency enhancer and the multi-view recarving strategy, our method can seamlessly integrate the details from different views into a uniform global shape. To better utilize the 3D human prior and enhance control over the generated geometry, we introduce a compact spherical embedding of 3D joints. This allows for an effective guidance of pose during the generation process. Additionally, our method can generate 3D humans guided by textual inputs. Our experimental results demonstrate the capability of our method to ensure global structure, local details, high resolution, and low computational cost simultaneously. More results and the code can be found on our project page at http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human. △ Less

Submitted 6 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2311.15950 [pdf, other]

Auto-CsiNet: Scenario-customized Automatic Neural Network Architecture Generation for Massive MIMO CSI Feedback

Authors: Xiangyi Li, Jiajia Guo, Chao-Kai Wen, Shi **

Abstract: Deep learning has revolutionized the design of the channel state information (CSI) feedback module in wireless communications. However, designing the optimal neural network (NN) architecture for CSI feedback can be a laborious and time-consuming process. Manual design can be prohibitively expensive for customizing NNs to different scenarios. This paper proposes using neural architecture search (NA… ▽ More Deep learning has revolutionized the design of the channel state information (CSI) feedback module in wireless communications. However, designing the optimal neural network (NN) architecture for CSI feedback can be a laborious and time-consuming process. Manual design can be prohibitively expensive for customizing NNs to different scenarios. This paper proposes using neural architecture search (NAS) to automate the generation of scenario-customized CSI feedback NN architectures, thereby maximizing the potential of deep learning in exclusive environments. By employing automated machine learning and gradient-descent-based NAS, an efficient and cost-effective architecture design process is achieved. The proposed approach leverages implicit scene knowledge, integrating it into the scenario customization process in a data-driven manner, and fully exploits the potential of deep learning for each specific scenario. To address the issue of excessive search, early stop** and elastic selection mechanisms are employed, enhancing the efficiency of the proposed scheme. The experimental results demonstrate that the automatically generated architecture, known as Auto-CsiNet, outperforms manually-designed models in both reconstruction performance (achieving approximately a 14% improvement) and complexity (reducing it by approximately 50%). Furthermore, the paper analyzes the impact of the scenario on the NN architecture and its capacity. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 16 pages, 10 figures, 6 tables

arXiv:2311.15313 [pdf, ps, other]

Low-Complexity Joint Beamforming for RIS-Assisted MU-MISO Systems Based on Model-Driven Deep Learning

Authors: Weijie **, **g Zhang, Chao-Kai Wen, Shi **, Xiao Li, Shuangfeng Han

Abstract: Reconfigurable intelligent surfaces (RIS) can improve signal propagation environments by adjusting the phase of the incident signal. However, optimizing the phase shifts jointly with the beamforming vector at the access point is challenging due to the non-convex objective function and constraints. In this study, we propose an algorithm based on weighted minimum mean square error optimization and p… ▽ More Reconfigurable intelligent surfaces (RIS) can improve signal propagation environments by adjusting the phase of the incident signal. However, optimizing the phase shifts jointly with the beamforming vector at the access point is challenging due to the non-convex objective function and constraints. In this study, we propose an algorithm based on weighted minimum mean square error optimization and power iteration to maximize the weighted sum rate (WSR) of a RIS-assisted downlink multi-user multiple-input single-output system. To further improve performance, a model-driven deep learning (DL) approach is designed, where trainable variables and graph neural networks are introduced to accelerate the convergence of the proposed algorithm. We also extend the proposed method to include beamforming with imperfect channel state information and derive a two-timescale stochastic optimization algorithm. Simulation results show that the proposed algorithm outperforms state-of-the-art algorithms in terms of complexity and WSR. Specifically, the model-driven DL approach has a runtime that is approximately 3% of the state-of-the-art algorithm to achieve the same performance. Additionally, the proposed algorithm with 2-bit phase shifters outperforms the compared algorithm with continuous phase shift. △ Less

Submitted 26 November, 2023; originally announced November 2023.

Comments: 14 pages, 9 figures, 2 tables. This paper has been accepted for publication by the IEEE Transactions on Wireless Communications. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2311.06916 [pdf]

TSViT: A Time Series Vision Transformer for Fault Diagnosis

Authors: Shouhua Zhang, Jiehan Zhou, Xue Ma, Chenglin Wen, Susanna Pirttikangas, Chen Yu, Weishan Zhang, Chunsheng Yang

Abstract: Traditional fault diagnosis methods using Convolutional Neural Networks (CNNs) face limitations in capturing temporal features (i.e., the variation of vibration signals over time). To address this issue, this paper introduces a novel model, the Time Series Vision Transformer (TSViT), specifically designed for fault diagnosis. On one hand, TSViT model integrates a convolutional layer to segment vib… ▽ More Traditional fault diagnosis methods using Convolutional Neural Networks (CNNs) face limitations in capturing temporal features (i.e., the variation of vibration signals over time). To address this issue, this paper introduces a novel model, the Time Series Vision Transformer (TSViT), specifically designed for fault diagnosis. On one hand, TSViT model integrates a convolutional layer to segment vibration signals and capture local features. On the other hand, it employs a transformer encoder to learn long-term temporal information. The experimental results with other methods on two distinct datasets validate the effectiveness and generalizability of TSViT with a comparative analysis of its hyperparameters' impact on model performance, computational complexity, and overall parameter quantity. TSViT reaches average accuracies of 100% and 99.99% on two test sets, correspondingly. △ Less

Submitted 12 November, 2023; originally announced November 2023.

arXiv:2311.00964 [pdf, other]

doi 10.1145/3637528.3671521

On Finding Bi-objective Pareto-optimal Fraud Prevention Rule Sets for Fintech Applications

Authors: Chengyao Wen, Yin Lou

Abstract: Rules are widely used in Fintech institutions to make fraud prevention decisions, since rules are highly interpretable thanks to their intuitive if-then structure. In practice, a two-stage framework of fraud prevention decision rule set mining is usually employed in large Fintech institutions; Stage 1 generates a potentially large pool of rules and Stage 2 aims to produce a refined rule subset acc… ▽ More Rules are widely used in Fintech institutions to make fraud prevention decisions, since rules are highly interpretable thanks to their intuitive if-then structure. In practice, a two-stage framework of fraud prevention decision rule set mining is usually employed in large Fintech institutions; Stage 1 generates a potentially large pool of rules and Stage 2 aims to produce a refined rule subset according to some criteria (typically based on precision and recall). This paper focuses on improving the flexibility and efficacy of this two-stage framework, and is concerned with finding high-quality rule subsets in a bi-objective space (such as precision and recall). To this end, we first introduce a novel algorithm called SpectralRules that directly generates a compact pool of rules in Stage 1 with high diversity. We empirically find such diversity improves the quality of the final rule subset. In addition, we introduce an intermediate stage between Stage 1 and 2 that adopts the concept of Pareto optimality and aims to find a set of non-dominated rule subsets, which constitutes a Pareto front. This intermediate stage greatly simplifies the selection criteria and increases the flexibility of Stage 2. For this intermediate stage, we propose a heuristic-based framework called PORS and we identify that the core of PORS is the problem of solution selection on the front (SSF). We provide a systematic categorization of the SSF problem and a thorough empirical evaluation of various SSF methods on both public and proprietary datasets. On two real application scenarios within Alipay, we demonstrate the advantages of our proposed methodology over existing work. △ Less

Submitted 27 June, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

arXiv:2311.00390 [pdf, other]

A Modular Pneumatic Soft Gripper Design for Aerial Gras** and Landing

Authors: Hiu Ching Cheung, Ching-Wei Chang, Bailun Jiang, Chih-Yung Wen, Henry K. Chu

Abstract: Aerial robots have garnered significant attention due to their potential applications in various industries, such as inspection, search and rescue, and drone delivery. Successful missions often depend on the ability of these robots to grasp and land effectively. This paper presents a novel modular soft gripper design tailored explicitly for aerial gras** and landing operations. The proposed modu… ▽ More Aerial robots have garnered significant attention due to their potential applications in various industries, such as inspection, search and rescue, and drone delivery. Successful missions often depend on the ability of these robots to grasp and land effectively. This paper presents a novel modular soft gripper design tailored explicitly for aerial gras** and landing operations. The proposed modular pneumatic soft gripper incorporates a feed-forward proportional controller to regulate pressure, enabling compliant grip** capabilities. The modular connectors of the soft fingers offer two configurations for the 4-tip soft gripper, H-base (cylindrical) and X-base (spherical), allowing adaptability to different target objects. Additionally, the gripper can serve as a soft landing gear when deflated, eliminating the need for an extra landing gear. This design reduces weight, simplifies aerial manipulation control, and enhances flight efficiency. We demonstrate the efficacy of indoor aerial gras** and achieve a maximum payload of 217 g using the proposed soft aerial vehicle and its H-base pneumatic soft gripper (808 g). △ Less

Submitted 25 March, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

Comments: 7 pages, 13 figures, accepted by IEEE RoboSoft 2024

arXiv:2310.07433 [pdf, other]

Imitation Learning from Observation with Automatic Discount Scheduling

Authors: Yuyang Liu, Weijun Dong, Yingdong Hu, Chuan Wen, Zhao-Heng Yin, Chongjie Zhang, Yang Gao

Abstract: Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinfor… ▽ More Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them. △ Less

Submitted 7 February, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: Accepted by ICLR 2024

arXiv:2309.15941 [pdf, other]

AutoEncoding Tree for City Generation and Applications

Authors: Wenyu Han, Congcong Wen, Lazarus Chok, Yan Liang Tan, Sheung Lung Chan, Hang Zhao, Chen Feng

Abstract: City modeling and generation have attracted an increased interest in various applications, including gaming, urban planning, and autonomous driving. Unlike previous works focused on the generation of single objects or indoor scenes, the huge volumes of spatial data in cities pose a challenge to the generative models. Furthermore, few publicly available 3D real-world city datasets also hinder the d… ▽ More City modeling and generation have attracted an increased interest in various applications, including gaming, urban planning, and autonomous driving. Unlike previous works focused on the generation of single objects or indoor scenes, the huge volumes of spatial data in cities pose a challenge to the generative models. Furthermore, few publicly available 3D real-world city datasets also hinder the development of methods for city generation. In this paper, we first collect over 3,000,000 geo-referenced objects for the city of New York, Zurich, Tokyo, Berlin, Boston and several other large cities. Based on this dataset, we propose AETree, a tree-structured auto-encoder neural network, for city generation. Specifically, we first propose a novel Spatial-Geometric Distance (SGD) metric to measure the similarity between building layouts and then construct a binary tree over the raw geometric data of building based on the SGD metric. Next, we present a tree-structured network whose encoder learns to extract and merge spatial information from bottom-up iteratively. The resulting global representation is reversely decoded for reconstruction or generation. To address the issue of long-dependency as the level of the tree increases, a Long Short-Term Memory (LSTM) Cell is employed as a basic network element of the proposed AETree. Moreover, we introduce a novel metric, Overlap** Area Ratio (OAR), to quantitatively evaluate the generation results. Experiments on the collected dataset demonstrate the effectiveness of the proposed model on 2D and 3D city generation. Furthermore, the latent features learned by AETree can serve downstream urban planning applications. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.04590 [pdf, other]

Robotic Defect Inspection with Visual and Tactile Perception for Large-scale Components

Authors: Arpit Agarwal, Abhiroop Ajith, Chengtao Wen, Veniamin Stryzheus, Brian Miller, Matthew Chen, Micah K. Johnson, Jose Luis Susa Rincon, Justinian Rosca, Wenzhen Yuan

Abstract: In manufacturing processes, surface inspection is a key requirement for quality assessment and damage localization. Due to this, automated surface anomaly detection has become a promising area of research in various industrial inspection systems. A particular challenge in industries with large-scale components, like aircraft and heavy machinery, is inspecting large parts with very small defect dim… ▽ More In manufacturing processes, surface inspection is a key requirement for quality assessment and damage localization. Due to this, automated surface anomaly detection has become a promising area of research in various industrial inspection systems. A particular challenge in industries with large-scale components, like aircraft and heavy machinery, is inspecting large parts with very small defect dimensions. Moreover, these parts can be of curved shapes. To address this challenge, we present a 2-stage multi-modal inspection pipeline with visual and tactile sensing. Our approach combines the best of both visual and tactile sensing by identifying and localizing defects using a global view (vision) and using the localized area for tactile scanning for identifying remaining defects. To benchmark our approach, we propose a novel real-world dataset with multiple metallic defect types per image, collected in the production environments on real aerospace manufacturing parts, as well as online robot experiments in two environments. Our approach is able to identify 85% defects using Stage I and identify 100% defects after Stage II. The dataset is publicly available at https://zenodo.org/record/8327713 △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: This is a pre-print for International Conference on Intelligent Robots and Systems 2023 publication

arXiv:2308.11335 [pdf, other]

Graph Neural Network-Enhanced Expectation Propagation Algorithm for MIMO Turbo Receivers

Authors: Xingyu Zhou, **g Zhang, Chao-Kai Wen, Shi **, Shuangfeng Han

Abstract: Deep neural networks (NNs) are considered a powerful tool for balancing the performance and complexity of multiple-input multiple-output (MIMO) receivers due to their accurate feature extraction, high parallelism, and excellent inference ability. Graph NNs (GNNs) have recently demonstrated outstanding capability in learning enhanced message passing rules and have shown success in overcoming the dr… ▽ More Deep neural networks (NNs) are considered a powerful tool for balancing the performance and complexity of multiple-input multiple-output (MIMO) receivers due to their accurate feature extraction, high parallelism, and excellent inference ability. Graph NNs (GNNs) have recently demonstrated outstanding capability in learning enhanced message passing rules and have shown success in overcoming the drawback of inaccurate Gaussian approximation of expectation propagation (EP)-based MIMO detectors. However, the application of the GNN-enhanced EP detector to MIMO turbo receivers is underexplored and non-trivial due to the requirement of extrinsic information for iterative processing. This paper proposes a GNN-enhanced EP algorithm for MIMO turbo receivers, which realizes the turbo principle of generating extrinsic information from the MIMO detector through a specially designed training procedure. Additionally, an edge pruning strategy is designed to eliminate redundant connections in the original fully connected model of the GNN utilizing the correlation information inherently from the EP algorithm. Edge pruning reduces the computational cost dramatically and enables the network to focus more attention on the weights that are vital for performance. Simulation results and complexity analysis indicate that the proposed MIMO turbo receiver outperforms the EP turbo approaches by over 1 dB at the bit error rate of $10^{-5}$, exhibits performance equivalent to state-of-the-art receivers with 2.5 times shorter running time, and adapts to various scenarios. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: 15 pages, 12 figures, 2 tables. This paper has been accepted for publication by the IEEE Transactions on Signal Processing. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2308.08855 [pdf, other]

Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling

Authors: Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, Xiaojie **

Abstract: To bridge the physical and virtual worlds for rapidly developed VR/AR applications, the ability to realistically drive 3D full-body avatars is of great significance. Although real-time body tracking with only the head-mounted displays (HMDs) and hand controllers is heavily under-constrained, a carefully designed end-to-end neural network is of great potential to solve the problem by learning from… ▽ More To bridge the physical and virtual worlds for rapidly developed VR/AR applications, the ability to realistically drive 3D full-body avatars is of great significance. Although real-time body tracking with only the head-mounted displays (HMDs) and hand controllers is heavily under-constrained, a carefully designed end-to-end neural network is of great potential to solve the problem by learning from large-scale motion data. To this end, we propose a two-stage framework that can obtain accurate and smooth full-body motions with the three tracking signals of head and hands only. Our framework explicitly models the joint-level features in the first stage and utilizes them as spatiotemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage. Furthermore, we design a set of loss terms to constrain the task of a high degree of freedom, such that we can exploit the potential of our joint-level modeling. With extensive experiments on the AMASS motion dataset and real-captured data, we validate the effectiveness of our designs and show our proposed method can achieve more accurate and smooth motion compared to existing approaches. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023. Project page: https://zxz267.github.io/AvatarJLM

arXiv:2308.06562 [pdf, other]

Gradient-Based Markov Chain Monte Carlo for MIMO Detection

Authors: Xingyu Zhou, Le Liang, **g Zhang, Chao-Kai Wen, Shi **

Abstract: Accurately detecting symbols transmitted over multiple-input multiple-output (MIMO) wireless channels is crucial in realizing the benefits of MIMO techniques. However, optimal MIMO detection is associated with a complexity that grows exponentially with the MIMO dimensions and quickly becomes impractical. Recently, stochastic sampling-based Bayesian inference techniques, such as Markov chain Monte… ▽ More Accurately detecting symbols transmitted over multiple-input multiple-output (MIMO) wireless channels is crucial in realizing the benefits of MIMO techniques. However, optimal MIMO detection is associated with a complexity that grows exponentially with the MIMO dimensions and quickly becomes impractical. Recently, stochastic sampling-based Bayesian inference techniques, such as Markov chain Monte Carlo (MCMC), have been combined with the gradient descent (GD) method to provide a promising framework for MIMO detection. In this work, we propose to efficiently approach optimal detection by exploring the discrete search space via MCMC random walk accelerated by Nesterov's gradient method. Nesterov's GD guides MCMC to make efficient searches without the computationally expensive matrix inversion and line search. Our proposed method operates using multiple GDs per random walk, achieving sufficient descent towards important regions of the search space before adding random perturbations, guaranteeing high sampling efficiency. To provide augmented exploration, extra samples are derived through the trajectory of Nesterov's GD by simple operations, effectively supplementing the sample list for statistical inference and boosting the overall MIMO detection performance. Furthermore, we design an early stop** tactic to terminate unnecessary further searches, remarkably reducing the complexity. Simulation results and complexity analysis reveal that the proposed method achieves exceptional performance in both uncoded and coded MIMO systems, adapts to realistic channel models, and scales well to large MIMO dimensions. △ Less

Submitted 5 December, 2023; v1 submitted 12 August, 2023; originally announced August 2023.

Comments: 16 pages, 12 figures, 2 tables. This paper has been accepted for publication by the IEEE Transactions on Wireless Communications. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2308.03016 [pdf, other]

Sha** a Smarter Electromagnetic Landscape: IAB, NCR, and RIS in 5G Standard and Future 6G

Authors: Chao-Kai Wen, Lung-Sheng Tsai, Arman Shojaeifard, Pei-Kai Liao, Kai-Kit Wong, Chan-Byoung Chae

Abstract: The main objective of 5G and beyond networks is to provide an optimal user experience in terms of throughput and reliability, irrespective of location and time. To achieve this, traditional fixed macro base station deployments are being replaced by more innovative and flexible solutions, such as wireless backhaul and relays. This article focuses on the evolution and standardization of these advanc… ▽ More The main objective of 5G and beyond networks is to provide an optimal user experience in terms of throughput and reliability, irrespective of location and time. To achieve this, traditional fixed macro base station deployments are being replaced by more innovative and flexible solutions, such as wireless backhaul and relays. This article focuses on the evolution and standardization of these advancements, which are sha** the electromagnetic landscape. Specifically, we explore Integrated Access and Backhaul (IAB) nodes, which offer a cost-efficient and agile alternative to fiber backhaul. We also discuss Network-Controlled Repeaters (NCRs) and the emergence of Reconfigurable Intelligent Surfaces (RIS) actively adapting the wireless environment. The article provides an overview of the 5G features and ongoing developments in 3GPP Release 18 related to these intelligent EM entities, highlighting the expected evolution of future wireless networks in terms of architecture, operations, and control signals. △ Less

Submitted 18 January, 2024; v1 submitted 6 August, 2023; originally announced August 2023.

Comments: 8 pages, 5 figures, 1 table. This work has been accepted to publish in IEEE Communications Standards Magazine

arXiv:2307.16173 [pdf]

doi 10.1109/TIE.2023.3265027

Data-Driven Modeling with Experimental Augmentation for the Modulation Strategy of the Dual-Active-Bridge Converter

Authors: Xinze Li, Josep Pou, Jiaxin Dong, Fanfan Lin, Changyun Wen, Suvajit Mukherjee, Xin Zhang

Abstract: For the performance modeling of power converters, the mainstream approaches are essentially knowledge-based, suffering from heavy manpower burden and low modeling accuracy. Recent emerging data-driven techniques greatly relieve human reliance by automatic modeling from simulation data. However, model discrepancy may occur due to unmodeled parasitics, deficient thermal and magnetic models, unpredic… ▽ More For the performance modeling of power converters, the mainstream approaches are essentially knowledge-based, suffering from heavy manpower burden and low modeling accuracy. Recent emerging data-driven techniques greatly relieve human reliance by automatic modeling from simulation data. However, model discrepancy may occur due to unmodeled parasitics, deficient thermal and magnetic models, unpredictable ambient conditions, etc. These inaccurate data-driven models based on pure simulation cannot represent the practical performance in physical world, hindering their applications in power converter modeling. To alleviate model discrepancy and improve accuracy in practice, this paper proposes a novel data-driven modeling with experimental augmentation (D2EA), leveraging both simulation data and experimental data. In D2EA, simulation data aims to establish basic functional landscape, and experimental data focuses on matching actual performance in real world. The D2EA approach is instantiated for the efficiency optimization of a hybrid modulation for neutral-point-clamped dual-active-bridge (NPC-DAB) converter. The proposed D2EA approach realizes 99.92% efficiency modeling accuracy, and its feasibility is comprehensively validated in 2-kW hardware experiments, where the peak efficiency of 98.45% is attained. Overall, D2EA is data-light and can achieve highly accurate and highly practical data-driven models in one shot, and it is scalable to other applications, effortlessly. △ Less

Submitted 2 August, 2023; v1 submitted 30 July, 2023; originally announced July 2023.

Comments: 11 pages

Journal ref: IEEE.Trans.Ind.Electron. Early Access (2023) 1-11

arXiv:2307.15290 [pdf, other]

ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation

Authors: Cheng Wen, Xianghui Sun, Shuaijiang Zhao, Xiaoquan Fang, Liangyu Chen, Wei Zou

Abstract: This paper presents the development and evaluation of ChatHome, a domain-specific language model (DSLM) designed for the intricate field of home renovation. Considering the proven competencies of large language models (LLMs) like GPT-4 and the escalating fascination with home renovation, this study endeavors to reconcile these aspects by generating a dedicated model that can yield high-fidelity, p… ▽ More This paper presents the development and evaluation of ChatHome, a domain-specific language model (DSLM) designed for the intricate field of home renovation. Considering the proven competencies of large language models (LLMs) like GPT-4 and the escalating fascination with home renovation, this study endeavors to reconcile these aspects by generating a dedicated model that can yield high-fidelity, precise outputs relevant to the home renovation arena. ChatHome's novelty rests on its methodology, fusing domain-adaptive pretraining and instruction-tuning over an extensive dataset. This dataset includes professional articles, standard documents, and web content pertinent to home renovation. This dual-pronged strategy is designed to ensure that our model can assimilate comprehensive domain knowledge and effectively address user inquiries. Via thorough experimentation on diverse datasets, both universal and domain-specific, including the freshly introduced "EvalHome" domain dataset, we substantiate that ChatHome not only amplifies domain-specific functionalities but also preserves its versatility. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: ChatHome,DSLM for home renovation

arXiv:2307.15280 [pdf, other]

Active RIS-Assisted MIMO-OFDM System: Analyses and Prototype Measurements

Authors: De-Ming Chian, Feng-Ji Chen, Yu-Chen Chang, Chao-Kai Wen, Chi-Hung Wu, Fu-Kang Wang, Kai-Kit Wong, Chan-Byoung Chae

Abstract: In this study, we develop an active reconfigurable intelligent surface (RIS)-assisted multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) prototype compliant with the 5G New Radio standard at 3.5~GHz. The experimental results clearly indicate that active RIS plays a vital role in enhancing MIMO performance, surpassing passive RIS. Furthermore, when considering fac… ▽ More In this study, we develop an active reconfigurable intelligent surface (RIS)-assisted multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) prototype compliant with the 5G New Radio standard at 3.5~GHz. The experimental results clearly indicate that active RIS plays a vital role in enhancing MIMO performance, surpassing passive RIS. Furthermore, when considering factors such as complexity, energy consumption, and performance, the comparative evaluation between passive RIS and active RIS reinforces the critical role of active RIS in MIMO systems. These findings underscore the practical significance of active RIS in improving MIMO gain in 5G scenarios. △ Less

Submitted 14 November, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: 5 pages, 5 figures, 1 table, accepted by IEEE Communications Letters, for demo video see: https://www.youtube.com/watch?v=3R6eZXizwns

arXiv:2307.15266 [pdf, other]

RSGPT: A Remote Sensing Vision Language Model and Benchmark

Authors: Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li

Abstract: The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in develo** large vision language models (VLMs) specifically tailored for data analysis in this domain… ▽ More The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in develo** large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2307.12049 [pdf, other]

Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

Authors: Cheng Wen, Baosheng Yu, Rao Fu, Dacheng Tao

Abstract: A generative model for high-fidelity point clouds is of great importance in synthesizing 3d environments for applications such as autonomous driving and robotics. Despite the recent success of deep generative models for 2d images, it is non-trivial to generate 3d point clouds without a comprehensive understanding of both local and global geometric structures. In this paper, we devise a new 3d poin… ▽ More A generative model for high-fidelity point clouds is of great importance in synthesizing 3d environments for applications such as autonomous driving and robotics. Despite the recent success of deep generative models for 2d images, it is non-trivial to generate 3d point clouds without a comprehensive understanding of both local and global geometric structures. In this paper, we devise a new 3d point cloud generation framework using a divide-and-conquer approach, where the whole generation process can be divided into a set of patch-wise generation tasks. Specifically, all patch generators are based on learnable priors, which aim to capture the information of geometry primitives. We introduce point- and patch-wise transformers to enable the interactions between points and patches. Therefore, the proposed divide-and-conquer approach contributes to a new understanding of point cloud generation from the geometry constitution of 3d shapes. Experimental results on a variety of object categories from the most popular point cloud dataset, ShapeNet, show the effectiveness of the proposed patch-wise point cloud generation, where it clearly outperforms recent state-of-the-art methods for high-fidelity point cloud generation. △ Less

Submitted 22 July, 2023; originally announced July 2023.

arXiv:2307.07936 [pdf, other]

doi 10.1109/TCOMM.2023.3294954

Joint Beam Management and SLAM for mmWave Communication Systems

Authors: Hang Que, Jie Yang, Chao-Kai Wen, Shuqiang Xia, Xiao Li, Shi **

Abstract: The millimeter-wave (mmWave) communication technology, which employs large-scale antenna arrays, enables inherent sensing capabilities. Simultaneous localization and map** (SLAM) can utilize channel multipath angle estimates to realize integrated sensing and communication design in 6G communication systems. However, existing works have ignored the significant overhead required by the mmWave beam… ▽ More The millimeter-wave (mmWave) communication technology, which employs large-scale antenna arrays, enables inherent sensing capabilities. Simultaneous localization and map** (SLAM) can utilize channel multipath angle estimates to realize integrated sensing and communication design in 6G communication systems. However, existing works have ignored the significant overhead required by the mmWave beam management when implementing SLAM with angle estimates. This study proposes a joint beam management and SLAM design that utilizes the strong coupling between the radio map and channel multipath for simultaneous beam management, localization, and map**. In this approach, we first propose a hierarchical swee** and sensing service design. The path angles are estimated in the hierarchical swee**, enabling angle-based SLAM with the aid of an inertial measurement unit (IMU) to realize sensing service. Then, feature-aided tracking is proposed that utilizes prior angle information generated from the radio map and IMU. Finally, a switching module is introduced to enable flexible switching between hierarchical swee** and feature-aided tracking. Simulations show that the proposed joint design can achieve sub-meter level localization and map** accuracy (with an error < 0.5 m). Moreover, the beam management overhead can be reduced by approximately 40% in different wireless environments. △ Less

Submitted 15 July, 2023; originally announced July 2023.

Journal ref: IEEE Transactions on Communications, early access, July 2023

arXiv:2307.04013 [pdf, other]

doi 10.24963/ijcai.2023/84

BPNet: Bézier Primitive Segmentation on 3D Point Clouds

Authors: Rao Fu, Cheng Wen, Qian Li, Xiao Xiao, Pierre Alliez

Abstract: This paper proposes BPNet, a novel end-to-end deep learning framework to learn Bézier primitive segmentation on 3D point clouds. The existing works treat different primitive types separately, thus limiting them to finite shape categories. To address this issue, we seek a generalized primitive segmentation on point clouds. Taking inspiration from Bézier decomposition on NURBS models, we transfer it… ▽ More This paper proposes BPNet, a novel end-to-end deep learning framework to learn Bézier primitive segmentation on 3D point clouds. The existing works treat different primitive types separately, thus limiting them to finite shape categories. To address this issue, we seek a generalized primitive segmentation on point clouds. Taking inspiration from Bézier decomposition on NURBS models, we transfer it to guide point cloud segmentation casting off primitive types. A joint optimization framework is proposed to learn Bézier primitive segmentation and geometric fitting simultaneously on a cascaded architecture. Specifically, we introduce a soft voting regularizer to improve primitive segmentation and propose an auto-weight embedding module to cluster point features, making the network more robust and generic. We also introduce a reconstruction module where we successfully process multiple CAD models with different primitives simultaneously. We conducted extensive experiments on the synthetic ABC dataset and real-scan datasets to validate and compare our approach with different baseline methods. Experiments show superior performance over previous work in terms of segmentation, with a substantially faster inference speed. △ Less

Submitted 15 October, 2023; v1 submitted 8 July, 2023; originally announced July 2023.

arXiv:2305.12669 [pdf, other]

Angle-based SLAM on 5G mmWave Systems: Design, Implementation, and Measurement

Authors: Jie Yang, Chao-Kai Wen, **g Xu, Hang Que, Haikun Wei, Shi **

Abstract: Simultaneous localization and map** (SLAM) is a key technology that provides user equipment (UE) tracking and environment map** services, enabling the deep integration of sensing and communication. The millimeter-wave (mmWave) communication, with its larger bandwidths and antenna arrays, inherently facilitates more accurate delay and angle measurements than sub-6 GHz communication, thereby pro… ▽ More Simultaneous localization and map** (SLAM) is a key technology that provides user equipment (UE) tracking and environment map** services, enabling the deep integration of sensing and communication. The millimeter-wave (mmWave) communication, with its larger bandwidths and antenna arrays, inherently facilitates more accurate delay and angle measurements than sub-6 GHz communication, thereby providing opportunities for SLAM. However, none of the existing works have realized the SLAM function under the 5G New Radio (NR) standard due to specification and hardware constraints. In this study, we investigate how 5G mmWave communication systems can achieve situational awareness without changing the transceiver architecture and 5G NR standard. We implement 28 GHz mmWave transceivers that deploy OFDM-based 5G NR waveform with 160 MHz channel bandwidth, and we realize beam management following the 5G NR. Furthermore, we develop an efficient successive cancellation-based angle extraction approach to obtain angles of arrival and departure from the reference signal received power measurements. On the basis of angle measurements, we propose an angle-only SLAM algorithm to track UE and map features in the radio environment. Thorough experiments and ray tracing-based computer simulations verify that the proposed angle-based SLAM can achieve sub-meter level localization and map** accuracy with a single base station and without the requirement of strict time synchronization. Our experiments also reveal many propagation properties critical to the success of SLAM in 5G mmWave communication systems. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted by the IEEE Internet of Things Journal

arXiv:2305.12332 [pdf, other]

doi 10.1109/TWC.2023.3266343

Joint Localization and Environment Sensing by Harnessing NLOS Components in RIS-aided mmWave Communication Systems

Authors: Yixuan Huang, Jie Yang, Wankai Tang, Chao-Kai Wen, Shuqiang Xia, Shi **

Abstract: This study explores the use of non-line-of-sight (NLOS) components in millimeter-wave (mmWave) communication systems for joint localization and environment sensing. The radar cross section (RCS) of a reconfigurable intelligent surface (RIS) is calculated to develop a general path gain model for RISs and traditional scatterers. The results show that RISs have a greater potential to assist in locali… ▽ More This study explores the use of non-line-of-sight (NLOS) components in millimeter-wave (mmWave) communication systems for joint localization and environment sensing. The radar cross section (RCS) of a reconfigurable intelligent surface (RIS) is calculated to develop a general path gain model for RISs and traditional scatterers. The results show that RISs have a greater potential to assist in localization due to their ability to maintain high RCSs and create strong NLOS links. A one-stage linear weighted least squares estimator is proposed to simultaneously determine user equipment (UE) locations, velocities, and scatterer (or RIS) locations using line-of-sight (LOS) and NLOS paths. The estimator supports environment sensing and UE localization even using only NLOS paths. A second-stage estimator is also introduced to improve environment sensing accuracy by considering the nonlinear relationship between UE and scatterer locations. Simulation results demonstrate the effectiveness of the proposed estimators in rich scattering environments and the benefits of using NLOS paths for improving UE location accuracy and assisting in environment sensing. The effects of RIS number, size, and deployment on localization performance are also analyzed. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: 32 pages, 12 figures, accepted by IEEE Transactions on Wireless Communications

Journal ref: IEEE Transactions on Wireless Communications, early access, April 2023

arXiv:2305.12308 [pdf, other]

MIMO Evolution toward 6G: End-User-Centric Collaborative MIMO

Authors: Lung-Sheng Tsai, Shang-Ling Shih, Pei-Kai Liao, Chao-Kai Wen

Abstract: In 6G, the trend of transitioning from massive antenna elements to even more massive ones is continued. However, installing additional antennas in the limited space of user equipment (UE) is challenging, resulting in limited capacity scaling gain for end users, despite network side support for increasing numbers of antennas. To address this issue, we propose an end-user-centric collaborative MIMO… ▽ More In 6G, the trend of transitioning from massive antenna elements to even more massive ones is continued. However, installing additional antennas in the limited space of user equipment (UE) is challenging, resulting in limited capacity scaling gain for end users, despite network side support for increasing numbers of antennas. To address this issue, we propose an end-user-centric collaborative MIMO (UE-CoMIMO) framework that groups several fixed or portable devices to provide a virtual abundance of antennas. This article outlines how advanced L1 relays and conventional relays enable device collaboration to offer diversity, rank, and localization enhancements. We demonstrate through system-level simulations how the UE-CoMIMO approaches lead to significant performance gains. Lastly, we discuss necessary research efforts to make UE-CoMIMO available for 6G and future research directions. △ Less

Submitted 14 November, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

Comments: 7 pages, 5 figures, 1 table. This work has been accepted in IEEE Communications Magazine

arXiv:2305.05726 [pdf, other]

Vision-Language Models in Remote Sensing: Current Progress and Future Trends

Authors: Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, Xiao Xiang Zhu

Abstract: The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide intelligent solutions close to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in remote sensing (RS), the scientific literature on t… ▽ More The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide intelligent solutions close to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in remote sensing primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image. This makes them better suited for tasks requiring visual and textual understanding, such as image captioning, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting challenges, and identifying potential research opportunities. △ Less

Submitted 2 April, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted by IEEE Geoscience and Remote Sensing Magazine

arXiv:2305.02464 [pdf, ps, other]

Multi-timescale Channel Customization for Transmission Design in RIS-assisted MIMO Systems

Authors: Weicong Chen, Chao-Kai Wen, Xiao Li, Shi **

Abstract: The performance of transmission schemes is heavily influenced by the wireless channel, which is typically considered an uncontrollable factor. However, the introduction of reconfigurable intelligent surfaces (RISs) to wireless communications enables the customization of a preferred channel for adopted transmissions by resha** electromagnetic waves. In this study, we propose multi-timescale chann… ▽ More The performance of transmission schemes is heavily influenced by the wireless channel, which is typically considered an uncontrollable factor. However, the introduction of reconfigurable intelligent surfaces (RISs) to wireless communications enables the customization of a preferred channel for adopted transmissions by resha** electromagnetic waves. In this study, we propose multi-timescale channel customization for RIS-assisted multiple-input multiple-output systems to facilitate transmission design. Specifically, we customize a high-rank channel for spatial multiplexing (SM) transmission and a highly correlated rank-1 channel for beamforming (BF) transmission by designing the phase shifters of the RIS with statistical channel state information in the angle-coherent time to improve spectral efficiency (SE). We derive closed-form expressions for the approximation and upper bound of the ergodic SE and compare them to investigate the relative SE performance of SM and BF transmissions. In terms of reliability enhancement, we customize a fast-changing channel in the symbol timescale to achieve more diversity gain for SM and BF transmissions. Extensive numerical results demonstrate that flexible customization of channel characteristics for a specific transmission scheme can achieve a tradeoff between SE and bit error ratio performance. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: Accepted by IEEE JSAC special issue on Beyond Shannon Communications: A Paradigm Shift to Catalyze 6G

arXiv:2305.02120 [pdf, ps, other]

doi 10.1109/TWC.2022.3226442

Channel Customization for Limited Feedback in RIS-assisted FDD Systems

Authors: Weicong Chen, Chao-Kai Wen, Xiao Li, Michail Matthaiou, Shi **

Abstract: Reconfigurable intelligent surfaces (RISs) represent a pioneering technology to realize smart electromagnetic environments by resha** the wireless channel. \textcolor[rgb]{0,0,0}{Jointly designing the transceiver and RIS relies on the channel state information (CSI), whose feedback has not been investigated in multi-RIS-assisted frequency division duplexing systems.} In this study, the limited f… ▽ More Reconfigurable intelligent surfaces (RISs) represent a pioneering technology to realize smart electromagnetic environments by resha** the wireless channel. \textcolor[rgb]{0,0,0}{Jointly designing the transceiver and RIS relies on the channel state information (CSI), whose feedback has not been investigated in multi-RIS-assisted frequency division duplexing systems.} In this study, the limited feedback of the RIS-assisted wireless channel is examined by capitalizing on the ability of the RIS in channel customization. \textcolor[rgb]{0,0,0}{By configuring the phase shifters of the surfaces using statistical CSI, we customize a sparse channel in rich-scattering environments, which significantly reduces the feedback overhead in designing the transceiver and RISs. Since the channel is customized in terms of singular value decomposition (SVD) with full-rank, the optimal SVD transceiver can be approached without a matrix decomposition and feeding back the complete channel parameters. The theoretical spectral efficiency (SE) loss of the proposed transceiver and RIS design is derived by considering the limited CSI quantization. To minimize the SE loss, a bit partitioning algorithm that splits the limited number of bits to quantize the CSI is developed.} Extensive numerical results show that the channel customization-based transceiver with reduced CSI can achieve satisfactory performance compared with the optimal transceiver with full CSI. Given the limited number of feedback bits, the bit partitioning algorithm can minimize the SE loss by adaptively allocating bits to quantize the channel parameters. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: Accepted by IEEE Transactions on Wireless Communications(https://ieeexplore.ieee.org/document/9976945)

arXiv:2304.03713 [pdf, other]

A Novel Channel Model for Reconfigurable Intelligent Surfaces with Consideration of Polarization and Switch Impairments

Authors: De-Ming Chian, Chao-Kai Wen, Chi-Hung Wu, Fu-Kang Wang, Kai-Kit Wong

Abstract: Future wireless networks require the ability to actively adjust the wireless environment to meet strict performance indicators. Reconfigurable Intelligent Surface (RIS) technology is gaining attention for its advantages of low power consumption, cost-effectiveness, and ease of deployment. However, existing channel models for RIS often ignore important properties, such as the impairment in the RIS'… ▽ More Future wireless networks require the ability to actively adjust the wireless environment to meet strict performance indicators. Reconfigurable Intelligent Surface (RIS) technology is gaining attention for its advantages of low power consumption, cost-effectiveness, and ease of deployment. However, existing channel models for RIS often ignore important properties, such as the impairment in the RIS's switch component and the polarization efficiency among antennas, limiting their practical use. In this paper, we propose a new channel model for RIS that considers these ignored properties, including the reflected field, scattered field, and antenna resonant mode. We verify the proposed model through the practical implementation of a 4 x 4 RIS array with patch antennas in the 3.5 GHz band, using a phase shifter as the switch component of a RIS element. The equivalent model of the phase shifter is also formulated and incorporated into the channel model. We propose a blind controlling algorithm to discuss the properties of our channel model and emphasize the importance of considering polarization and tracking mechanisms for the controlling algorithm. Our channel model is an improvement over existing models and can be used in the practical design of RIS technology. The proposed algorithm provides a practical approach to controlling the wireless environment, suitable for various wireless applications. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: 14 pages, 12 figures, 1 table. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Showing 1–50 of 211 results for author: Wen, C