Search | arXiv e-print repository

KAGNNs: Kolmogorov-Arnold Networks meet Graph Learning

Authors: Roman Bresson, Giannis Nikolentzos, George Panagopoulos, Michail Chatzianastasis, Jun Pang, Michalis Vazirgiannis

Abstract: In recent years, Graph Neural Networks (GNNs) have become the de facto tool for learning node and graph representations. Most GNNs typically consist of a sequence of neighborhood aggregation (a.k.a., message passing) layers. Within each of these layers, the representation of each node is updated from an aggregation and transformation of its neighbours representations at the previous layer. The upp… ▽ More In recent years, Graph Neural Networks (GNNs) have become the de facto tool for learning node and graph representations. Most GNNs typically consist of a sequence of neighborhood aggregation (a.k.a., message passing) layers. Within each of these layers, the representation of each node is updated from an aggregation and transformation of its neighbours representations at the previous layer. The upper bound for the expressive power of message passing GNNs was reached through the use of MLPs as a transformation, due to their universal approximation capabilities. However, MLPs suffer from well-known limitations, which recently motivated the introduction of Kolmogorov-Arnold Networks (KANs). KANs rely on the Kolmogorov-Arnold representation theorem, rendering them a promising alternative to MLPs. In this work, we compare the performance of KANs against that of MLPs in graph learning tasks. We perform extensive experiments on node classification, graph classification and graph regression datasets. Our preliminary results indicate that while KANs are on-par with MLPs in classification tasks, they seem to have a clear advantage in the graph regression tasks. Code is available at https: //github.com/RomanBresson/KAGNN. △ Less

Submitted 1 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.14841 [pdf, other]

TabularMark: Watermarking Tabular Datasets for Machine Learning

Authors: Yihao Zheng, Haocheng Xia, Junyuan Pang, **fei Liu, Kui Ren, Lingyang Chu, Yang Cao, Li Xiong

Abstract: Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we… ▽ More Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we watermark tabular datasets without significantly compromising their utility for training ML models while preventing attackers from training usable ML models on attacked datasets? In this paper, we propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding, which is adaptable for numerical and categorical attributes while preserving the data utility. For detection, a custom-threshold one proportion z-test is employed, which can reliably determine the presence of the watermark. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14558 [pdf, other]

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

Authors: Jiawei Gao, Ziqin Wang, Zeqi Xiao, **gbo Wang, Tai Wang, **kun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang

Abstract: Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges… ▽ More Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13243 [pdf, ps, other]

Abelian Group Codes for Classical and Classical-Quantum Channels: One-shot and Asymptotic Rate Bounds

Authors: James Chin-Jen Pang, Sandeep Pradhan, Hessam Mahdavifar

Abstract: We study the problem of transmission of information over classical and classical-quantum channels in the one-shot regime where the underlying codes are constrained to be group codes. In the achievability part, we introduce a new input probability distribution that incorporates the encoding homomorphism and the underlying channel law. Using a random coding argument, we characterize the performance… ▽ More We study the problem of transmission of information over classical and classical-quantum channels in the one-shot regime where the underlying codes are constrained to be group codes. In the achievability part, we introduce a new input probability distribution that incorporates the encoding homomorphism and the underlying channel law. Using a random coding argument, we characterize the performance of group codes in terms of hypothesis testing relative-entropic quantities. In the converse part, we establish bounds by leveraging a hypothesis testing-based approach. Furthermore, we apply the one-shot result to the asymptotic stationary memoryless setting, and establish a single-letter lower bound on group capacities for both classes of channels. Moreover, we derive a matching upper bound on the asymptotic group capacity. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 41 pages

arXiv:2406.09401 [pdf, other]

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Authors: Ruiyuan Lyu, Tai Wang, **gli Lin, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang

Abstract: With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the… ▽ More With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Follow-up of EmbodiedScan. A multi-modal 3D dataset with the most-ever comprehensive language annotations for 3D-LLMs. Project page: https://tai-wang.github.io/mmscan/

arXiv:2406.08001 [pdf, other]

Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

Authors: Jiaxin Deng, Junbiao Pang, Baochang Zhang

Abstract: Sharpness-Aware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to base optimizer (e.g., SGD). We propose Asymptotic Unbiased Sampling with respect to iterations to accelerate SAM (AUSAM), which maintains the model's generalization capacity while significantly enhancing computational… ▽ More Sharpness-Aware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to base optimizer (e.g., SGD). We propose Asymptotic Unbiased Sampling with respect to iterations to accelerate SAM (AUSAM), which maintains the model's generalization capacity while significantly enhancing computational efficiency. Concretely, we probabilistically sample a subset of data points beneficial for SAM optimization based on a theoretically guaranteed criterion, i.e., the Gradient Norm of each Sample (GNS). We further approximate the GNS by the difference in loss values before and after perturbation in SAM. As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks, i.e., classification, human pose estimation and network quantization. On CIFAR10/100 and Tiny-ImageNet, AUSAM achieves results comparable to SAM while providing a speedup of over 70%. Compared to recent dynamic data pruning methods, AUSAM is better suited for SAM and excels in maintaining performance. Additionally, AUSAM accelerates optimization in human pose estimation and model quantization without sacrificing performance, demonstrating its broad practicality. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.21070 [pdf, other]

Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

Authors: Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

Abstract: Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to stud… ▽ More Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail. △ Less

Submitted 14 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.11809 [pdf, other]

Distill-then-prune: An Efficient Compression Framework for Real-time Stereo Matching Network on Edge Devices

Authors: Baiyu Pan, Jichao Jiao, Jianxing Pang, Jun Cheng

Abstract: In recent years, numerous real-time stereo matching methods have been introduced, but they often lack accuracy. These methods attempt to improve accuracy by introducing new modules or integrating traditional methods. However, the improvements are only modest. In this paper, we propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off betw… ▽ More In recent years, numerous real-time stereo matching methods have been introduced, but they often lack accuracy. These methods attempt to improve accuracy by introducing new modules or integrating traditional methods. However, the improvements are only modest. In this paper, we propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off between speed and accuracy. As a result, we obtained a model that maintains real-time performance while delivering high accuracy on edge devices. Our proposed method involves three key steps. Firstly, we review state-of-the-art methods and design our lightweight model by removing redundant modules from those efficient models through a comparison of their contributions. Next, we leverage the efficient model as the teacher to distill knowledge into the lightweight model. Finally, we systematically prune the lightweight model to obtain the final model. Through extensive experiments conducted on two widely-used benchmarks, Sceneflow and KITTI, we perform ablation studies to analyze the effectiveness of each module and present our state-of-the-art results. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: International Conference on Robotics and Automation (ICRA) 2024

arXiv:2405.10625 [pdf, other]

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Authors: Jiayun Pang, Ivan Vulić

Abstract: Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on langu… ▽ More Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: Preprint

arXiv:2405.10370 [pdf, other]

Grounded 3D-LLM with Referent Tokens

Authors: Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang

Abstract: Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to ref… ▽ More Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrap** existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: https://groundedscenellm.github.io/grounded_3d-llm.github.io. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: Preprint

arXiv:2405.08458 [pdf, other]

Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

Authors: ** Wang, Bingfeng Zhang, Jian Pang, Honglong Chen, Weifeng Liu

Abstract: Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes si… ▽ More Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: Accepted by CVPR 2024; The camera-ready version

arXiv:2405.06250 [pdf]

Robust field-free switching using large unconventional spin-orbit torque in an all-van der Waals heterostructure

Authors: Yiyang Zhang, Xiaolin Ren, Ruizi Liu, Zehan Chen, Xuezhao Wu, Jie Pang, Wei Wang, Guibin Lan, Kenji Watanabe, Takashi Taniguchi, Youguo Shi, Guoqiang Yu, Qiming Shao

Abstract: The emerging all-van der Waals (vdW) magnetic heterostructure provides a new platform to control the magnetization by the electric field beyond the traditional spintronics devices. One promising strategy is using unconventional spin-orbit torque (SOT) exerted by the out-of-plane polarized spin current to enable deterministic magnetization switching and enhance the switching efficiency. However, in… ▽ More The emerging all-van der Waals (vdW) magnetic heterostructure provides a new platform to control the magnetization by the electric field beyond the traditional spintronics devices. One promising strategy is using unconventional spin-orbit torque (SOT) exerted by the out-of-plane polarized spin current to enable deterministic magnetization switching and enhance the switching efficiency. However, in all-vdW heterostructures, large unconventional SOT remains elusive and the robustness of the field-free switching against external magnetic field hasn't been examined, which hinder further applications. Here we demonstrate the field-free switching in an all-vdW heterostructure combining a type-II Weyl semimetal TaIrTe4 and above-room-temperature ferromagnet Fe3GaTe2. The fully field-free switching can be achieved at 2.56 x 10^10 A per m2 at 300K and a large SOT efficiency of the out-of-plane polarized spin current generated by TaIrTe4 is determined to be 0.37. Moreover, we find that the switching polarity cannot be changed until the external in-plane magnetic field reaches 252mT, indicating a robust switching against the magnetic field. The numerical simulation suggests the large unconventional SOT reduces the switching current density and enhances the robustness of the switching. Our work shows that all-vdW heterostructures are promising candidates for future highly efficient and stable SOT-based devices. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.03466 [pdf]

A family of air-stable chalcogenide solid electrolytes in Li$_2$BMQ$_4$ (B = Ca, Sr and Ba; M = Si, Ge and Sn; Q = O, S and Se) systems

Authors: Huican Mao, Xiang Zhu, Guangmao Li, Jie Pang, Junfeng Hao, Liqi Wang, Hailong Yu, Youguo Shi, Fan Wu, Shilie Pan, Ruijuan Xiao, Hong Li, Liquan Chen

Abstract: Combining high-throughput first-principles calculations and experimental measurements, we have identified a novel family of fast lithium-ion chalcogenide conductors in Li$_2$BMQ$_4$ (2114, B = Ca, Sr and Ba; M = Si, Ge and Sn; Q = O, S and Se) systems. Our calculations demonstrate that most of the thermodynamically and kinetically stable sulfides and selenides in this new system exhibit ultralow L… ▽ More Combining high-throughput first-principles calculations and experimental measurements, we have identified a novel family of fast lithium-ion chalcogenide conductors in Li$_2$BMQ$_4$ (2114, B = Ca, Sr and Ba; M = Si, Ge and Sn; Q = O, S and Se) systems. Our calculations demonstrate that most of the thermodynamically and kinetically stable sulfides and selenides in this new system exhibit ultralow Li$^+$ ion migration activation energy (0.16 eV ~ 0.56 eV) and considerable bandgaps varying between ~ 2 eV and 3.5 eV. We have successfully synthesized Li$_2$BaSnS$_4$ and Li$_2$SrSiS$_4$, and they exhibit excellent moisture stability through H$_2$S gas measurements. Electrochemical impedance measurements indicate 2114 systems show the typical features of solid ionic conductors, with a room-temperature Li$^+$ conductivity close to 5$\times$10$^{-4}$ mS/cm aligning with our molecular dynamics simulations. Furthermore, we have theoretically investigated the substitution of Cl$^-$ at S$^{2-}$ site. The doped compounds display significantly higher conductivity, with an increase of about three orders of magnitude (up to a maximum of 0.72 mS/cm) compared to the undoped compounds. These findings offer valuable insights for the further exploration of potential chalcogenide solid electrolyte materials with robust air stability and enhanced ionic conductivity for practical applications in lithium-ion batteries. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.00378 [pdf, other]

Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation

Authors: Hanyang Chi, Jian Pang, Bingfeng Zhang, Weifeng Liu

Abstract: Consistency learning is a central strategy to tackle unlabeled data in semi-supervised medical image segmentation (SSMIS), which enforces the model to produce consistent predictions under the perturbation. However, most current approaches solely focus on utilizing a specific single perturbation, which can only cope with limited cases, while employing multiple perturbations simultaneously is hard t… ▽ More Consistency learning is a central strategy to tackle unlabeled data in semi-supervised medical image segmentation (SSMIS), which enforces the model to produce consistent predictions under the perturbation. However, most current approaches solely focus on utilizing a specific single perturbation, which can only cope with limited cases, while employing multiple perturbations simultaneously is hard to guarantee the quality of consistency learning. In this paper, we propose an Adaptive Bidirectional Displacement (ABD) approach to solve the above challenge. Specifically, we first design a bidirectional patch displacement based on reliable prediction confidence for unlabeled data to generate new samples, which can effectively suppress uncontrollable regions and still retain the influence of input perturbations. Meanwhile, to enforce the model to learn the potentially uncontrollable content, a bidirectional displacement operation with inverse confidence is proposed for the labeled images, which generates samples with more unreliable information to facilitate model learning. Extensive experiments show that ABD achieves new state-of-the-art performances for SSMIS, significantly improving different baselines. Source code is available at https://github.com/chy-upc/ABD. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: Accepted to CVPR 2024

arXiv:2404.14405 [pdf, other]

Learning H-Infinity Locomotion Control

Authors: Junfeng Long, Wenye Yu, Quanyi Li, Zirui Wang, Dahua Lin, Jiangmiao Pang

Abstract: Stable locomotion in precipitous environments is an essential task for quadruped robots, requiring the ability to resist various external disturbances. Recent neural policies enhance robustness against disturbances by learning to resist external forces sampled from a fixed distribution in the simulated environment. However, the force generation process doesn't consider the robot's current state, m… ▽ More Stable locomotion in precipitous environments is an essential task for quadruped robots, requiring the ability to resist various external disturbances. Recent neural policies enhance robustness against disturbances by learning to resist external forces sampled from a fixed distribution in the simulated environment. However, the force generation process doesn't consider the robot's current state, making it difficult to identify the most effective direction and magnitude that can push the robot to the most unstable but recoverable state. Thus, challenging cases in the buffer are insufficient to optimize robustness. In this paper, we propose to model the robust locomotion learning process as an adversarial interaction between the locomotion policy and a learnable disturbance that is conditioned on the robot state to generate appropriate external forces. To make the joint optimization stable, our novel $H_{\infty}$ constraint mandates the bound of the ratio between the cost and the intensity of the external forces. We verify the robustness of our approach in both simulated environments and real-world deployment, on quadrupedal locomotion tasks and a more challenging task where the quadruped performs locomotion merely on hind legs. Training and deployment code will be made public. △ Less

Submitted 12 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: Project Page: https://junfeng-long.github.io/HINF/

arXiv:2404.12702 [pdf, other]

Modeling Multi-Granularity Context Information Flow for Pavement Crack Detection

Authors: Junbiao Pang, Baocheng Xiong, Jiaqi Wu

Abstract: Crack detection has become an indispensable, interesting yet challenging task in the computer vision community. Specially, pavement cracks have a highly complex spatial structure, a low contrasting background and a weak spatial continuity, posing a significant challenge to an effective crack detection method. In this paper, we address these problems from a view that utilizes contexts of the cracks… ▽ More Crack detection has become an indispensable, interesting yet challenging task in the computer vision community. Specially, pavement cracks have a highly complex spatial structure, a low contrasting background and a weak spatial continuity, posing a significant challenge to an effective crack detection method. In this paper, we address these problems from a view that utilizes contexts of the cracks and propose an end-to-end deep learning method to model the context information flow. To precisely localize crack from an image, it is critical to effectively extract and aggregate multi-granularity context, including the fine-grained local context around the cracks (in spatial-level) and the coarse-grained semantics (in segment-level). Concretely, in Convolutional Neural Network (CNN), low-level features extracted by the shallow layers represent the local information, while the deep layers extract the semantic features. Additionally, a second main insight in this work is that the semantic context should be an guidance to local context feature. By the above insights, the proposed method we first apply the dilated convolution as the backbone feature extractor to model local context, then we build a context guidance module to leverage semantic context to guide local feature extraction at multiple stages. To handle label alignment between stages, we apply the Multiple Instance Learning (MIL) strategy to align the high-level feature to the low-level ones in the stage-wise context flow. In addition, compared with these public crack datasets, to our best knowledge, we release the largest, most complex and most challenging Bitumen Pavement Crack (BPC) dataset. The experimental results on the three crack datasets demonstrate that the proposed method performs well and outperforms the current state-of-the-art methods. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.11844 [pdf, ps, other]

Finding A Taxi with Illegal Driver Substitution Activity via Behavior Modelings

Authors: Junbiao Pang, Muhammad Ayub Sabir, Zhuyun Wang, An**g Hu, Xue Yang, Haitao Yu, Qingming Huang

Abstract: In our urban life, Illegal Driver Substitution (IDS) activity for a taxi is a grave unlawful activity in the taxi industry, possibly causing severe traffic accidents and painful social repercussions. Currently, the IDS activity is manually supervised by law enforcers, i.e., law enforcers empirically choose a taxi and inspect it. The pressing problem of this scheme is the dilemma between the limite… ▽ More In our urban life, Illegal Driver Substitution (IDS) activity for a taxi is a grave unlawful activity in the taxi industry, possibly causing severe traffic accidents and painful social repercussions. Currently, the IDS activity is manually supervised by law enforcers, i.e., law enforcers empirically choose a taxi and inspect it. The pressing problem of this scheme is the dilemma between the limited number of law-enforcers and the large volume of taxis. In this paper, motivated by this problem, we propose a computational method that helps law enforcers efficiently find the taxis which tend to have the IDS activity. Firstly, our method converts the identification of the IDS activity to a supervised learning task. Secondly, two kinds of taxi driver behaviors, i.e., the Slee** Time and Location (STL) behavior and the Pick-Up (PU) behavior are proposed. Thirdly, the multiple scale pooling on self-similarity is proposed to encode the individual behaviors into the universal features for all taxis. Finally, a Multiple Component- Multiple Instance Learning (MC-MIL) method is proposed to handle the deficiency of the behavior features and to align the behavior features simultaneously. Extensive experiments on a real-world data set shows that the proposed behavior features have a good generalization ability across different classifiers, and the proposed MC-MIL method suppresses the baseline methods. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.10985 [pdf, ps, other]

Pixel-Wise Symbol Spotting via Progressive Points Location for Parsing CAD Images

Authors: Junbiao Pang, Zailin Dong, Jiaxin Deng, Mengyuan Zhu, Yunwei Zhang

Abstract: Parsing Computer-Aided Design (CAD) drawings is a fundamental step for CAD revision, semantic-based management, and the generation of 3D prototypes in both the architecture and engineering industries. Labeling symbols from a CAD drawing is a challenging yet notorious task from a practical point of view. In this work, we propose to label and spot symbols from CAD images that are converted from CAD… ▽ More Parsing Computer-Aided Design (CAD) drawings is a fundamental step for CAD revision, semantic-based management, and the generation of 3D prototypes in both the architecture and engineering industries. Labeling symbols from a CAD drawing is a challenging yet notorious task from a practical point of view. In this work, we propose to label and spot symbols from CAD images that are converted from CAD drawings. The advantage of spotting symbols from CAD images lies in the low requirement of labelers and the low-cost annotation. However, pixel-wise spotting symbols from CAD images is challenging work. We propose a pixel-wise point location via Progressive Gaussian Kernels (PGK) to balance between training efficiency and location accuracy. Besides, we introduce a local offset to the heatmap-based point location method. Based on the keypoints detection, we propose a symbol grou** method to redraw the rectangle symbols in CAD images. We have released a dataset containing CAD images of equipment rooms from telecommunication industrial CAD drawings. Extensive experiments on this real-world dataset show that the proposed method has good generalization ability. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: 10 pages, 10 figures,6 tables

arXiv:2404.09460 [pdf, other]

Optimal Real-time Bidding Strategy For EV Aggregators in Wholesale Electricity Markets

Authors: Shihan Huang, Dongkun Han, John Zhen Fu Pang, Yue Chen

Abstract: With the rapid growth of electric vehicles (EVs), EV aggregators have been playing a increasingly vital role in power systems by not merely providing charging management but also participating in wholesale electricity markets. This work studies the optimal real-time bidding strategy for an EV aggregator. Since the charging process of EVs is time-coupled, it is necessary for EV aggregators to consi… ▽ More With the rapid growth of electric vehicles (EVs), EV aggregators have been playing a increasingly vital role in power systems by not merely providing charging management but also participating in wholesale electricity markets. This work studies the optimal real-time bidding strategy for an EV aggregator. Since the charging process of EVs is time-coupled, it is necessary for EV aggregators to consider future operational conditions (e.g., future EV arrivals) when deciding the current bidding strategy. However, accurately forecasting future operational conditions is challenging under the inherent uncertainties. Hence, there demands a real-time bidding strategy based solely on the up-to-date information, which is the main goal of this work. We start by develo** an online optimal EV charging management algorithm for the EV aggregator via Lyapunov optimization. Based on this, an optimal real-time bidding strategy (bidding cost curve and bounds) for the aggregator is derived. Then, an efficient yet practical algorithm is proposed to obtain the bidding strategy. It shows that with the proposed bidding strategy, the aggregator's profit is nearly offline optimal. Moreover, the wholesale electricity market clearing result aligns with the individual aggregator's optimal charging strategy given the prices. Case studies against several benchmarks are conducted to evaluate the performance of the proposed method. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 13 pages, 6 figures

arXiv:2404.09294 [pdf, other]

doi 10.1103/PhysRevA.109.043324

Miscibility of Binary Bose-Einstein Condensates with $p$-wave Interaction

Authors: Min Deng, Ming Xue, **ghan Pang, Hui Luo, Zhiguo Wang, **bin Li, Dayou Yang

Abstract: We investigate the ground-state phase diagram of a binary mixture of Bose-Einstein condensates (BECs) with competing interspecies $s$- and $p$-wave interactions. Exploiting a pseudopotential model for the $l=1$ partial wave, we derive an extended Gross-Pitaevskii (GP) equation for the BEC mixture that incorporates both $s$- and $p$-wave interactions. Based on it, we study the miscible-immiscible t… ▽ More We investigate the ground-state phase diagram of a binary mixture of Bose-Einstein condensates (BECs) with competing interspecies $s$- and $p$-wave interactions. Exploiting a pseudopotential model for the $l=1$ partial wave, we derive an extended Gross-Pitaevskii (GP) equation for the BEC mixture that incorporates both $s$- and $p$-wave interactions. Based on it, we study the miscible-immiscible transition of a binary BEC mixture in the presence of interspecies $p$-wave interaction, by combining numerical solution of the GP equation and Gaussian variational analysis. Our study uncovers a dual effect -- either enhance or reduce miscibility -- of positive interspecies $p$-wave interaction, which can be precisely controlled by adjusting relevant experimental parameters. By complete characterizing the miscibility phase diagram, we establish a promising avenue towards experimental control of the miscibility of binary BEC mixtures via high partial-wave interactions. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: 10+3 pages, 6 figures, Phys. Rev. A (2024)

Journal ref: Phys. Rev. A 109, 043324 (2024)

arXiv:2404.09248 [pdf, other]

Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

Authors: **g-Cheng Pang, Si-Hang Yang, Kaiyuan Li, Jiaji Zhang, Xiong-Hui Chen, Nan Tang, Yang Yu

Abstract: Reinforcement learning (RL) trains agents to accomplish complex tasks through environmental interaction data, but its capacity is also limited by the scope of the available data. To obtain a knowledgeable agent, a promising approach is to leverage the knowledge from large language models (LLMs). Despite previous studies combining LLMs with RL, seamless integration of the two components remains cha… ▽ More Reinforcement learning (RL) trains agents to accomplish complex tasks through environmental interaction data, but its capacity is also limited by the scope of the available data. To obtain a knowledgeable agent, a promising approach is to leverage the knowledge from large language models (LLMs). Despite previous studies combining LLMs with RL, seamless integration of the two components remains challenging due to their semantic gap. This paper introduces a novel method, Knowledgeable Agents from Language Model Rollouts (KALM), which extracts knowledge from LLMs in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods. The primary challenge of KALM lies in LLM grounding, as LLMs are inherently limited to textual data, whereas environmental data often comprise numerical vectors unseen to LLMs. To address this, KALM fine-tunes the LLM to perform various tasks based on environmental data, including bidirectional translation between natural language descriptions of skills and their corresponding rollout data. This grounding process enhances the LLM's comprehension of environmental dynamics, enabling it to generate diverse and meaningful imaginary rollouts that reflect novel skills. Initial empirical evaluations on the CLEVR-Robot environment demonstrate that KALM enables agents to complete complex rephrasings of task goals and extend their capabilities to novel tasks requiring unprecedented optimal behaviors. KALM achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods. Furthermore, KALM effectively enables the LLM to comprehend environmental dynamics, resulting in the generation of meaningful imaginary rollouts that reflect novel skills and demonstrate the seamless integration of large language models and reinforcement learning. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2404.00409 [pdf, other]

3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting

Authors: Xiaoyang Lyu, Yang-Tian Sun, Yi-Hua Huang, Xiuzhe Wu, Ziyi Yang, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

Abstract: In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR, that allows for accurate 3D reconstruction with intricate details while inheriting the high efficiency and rendering quality of 3DGS. The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized. Firs… ▽ More In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR, that allows for accurate 3D reconstruction with intricate details while inheriting the high efficiency and rendering quality of 3DGS. The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized. First, we introduce a differentiable SDF-to-opacity transformation function that converts SDF values into corresponding Gaussians' opacities. This function connects the SDF and 3D Gaussians, allowing for unified optimization and enforcing surface constraints on the 3D Gaussians. During learning, optimizing the 3D Gaussians provides supervisory signals for SDF learning, enabling the reconstruction of intricate details. However, this only provides sparse supervisory signals to the SDF at locations occupied by Gaussians, which is insufficient for learning a continuous SDF. Then, to address this limitation, we incorporate volumetric rendering and align the rendered geometric attributes (depth, normal) with those derived from 3D Gaussians. This consistency regularization introduces supervisory signals to locations not covered by discrete 3D Gaussians, effectively eliminating redundant surfaces outside the Gaussian sampling range. Our extensive experimental results demonstrate that our 3DGSR method enables high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS. Besides, our method competes favorably with leading surface reconstruction techniques while offering a more efficient learning process and much better rendering qualities. The code will be available at https://github.com/CVMI-Lab/3DGSR. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.19289 [pdf, other]

Uplift Modeling Under Limited Supervision

Authors: George Panagopoulos, Daniele Malitesta, Fragkiskos D. Malliaros, Jun Pang

Abstract: Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from… ▽ More Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or treatment policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random, underlining the need for models that can generalize with limited supervision to reduce experimental risks. △ Less

Submitted 7 June, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.18407 [pdf, other]

A Channel-ensemble Approach: Unbiased and Low-variance Pseudo-labels is Critical for Semi-supervised Classification

Authors: Jiaqi Wu, Junbiao Pang, Baochang Zhang, Qingming Huang

Abstract: Semi-supervised learning (SSL) is a practical challenge in computer vision. Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of The Art (SOTA) performances in SSL. These approaches employ a threshold-to-pseudo-label (T2L) process to generate PLs by truncating the confidence scores of unlabeled data predicted by the self-training method. However, self-trained models typical… ▽ More Semi-supervised learning (SSL) is a practical challenge in computer vision. Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of The Art (SOTA) performances in SSL. These approaches employ a threshold-to-pseudo-label (T2L) process to generate PLs by truncating the confidence scores of unlabeled data predicted by the self-training method. However, self-trained models typically yield biased and high-variance predictions, especially in the scenarios when a little labeled data are supplied. To address this issue, we propose a lightweight channel-based ensemble method to effectively consolidate multiple inferior PLs into the theoretically guaranteed unbiased and low-variance one. Importantly, our approach can be readily extended to any SSL framework, such as FixMatch or FreeMatch. Experimental results demonstrate that our method significantly outperforms state-of-the-art techniques on CIFAR10/100 in terms of effectiveness and efficiency. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.18259 [pdf, other]

RoboKeyGen: Robot Pose and Joint Angles Estimation via Diffusion-based 3D Keypoint Generation

Authors: Yang Tian, Jiyao Zhang, Guowei Huang, Bin Wang, ** Wang, Jiangmiao Pang, Hao Dong

Abstract: Estimating robot pose and joint angles is significant in advanced robotics, enabling applications like robot collaboration and online hand-eye calibration.However, the introduction of unknown joint angles makes prediction more complex than simple robot pose estimation, due to its higher dimensionality.Previous methods either regress 3D keypoints directly or utilise a render&compare strategy. These… ▽ More Estimating robot pose and joint angles is significant in advanced robotics, enabling applications like robot collaboration and online hand-eye calibration.However, the introduction of unknown joint angles makes prediction more complex than simple robot pose estimation, due to its higher dimensionality.Previous methods either regress 3D keypoints directly or utilise a render&compare strategy. These approaches often falter in terms of performance or efficiency and grapple with the cross-camera gap problem.This paper presents a novel framework that bifurcates the high-dimensional prediction task into two manageable subtasks: 2D keypoints detection and lifting 2D keypoints to 3D. This separation promises enhanced performance without sacrificing the efficiency innate to keypoint-based techniques.A vital component of our method is the lifting of 2D keypoints to 3D keypoints. Common deterministic regression methods may falter when faced with uncertainties from 2D detection errors or self-occlusions.Leveraging the robust modeling potential of diffusion models, we reframe this issue as a conditional 3D keypoints generation task. To bolster cross-camera adaptability, we introduce theNormalised Camera Coordinate Space (NCCS), ensuring alignment of estimated 2D keypoints across varying camera intrinsics.Experimental results demonstrate that the proposed method outperforms the state-of-the-art render\&compare method and achieves higher inference speed.Furthermore, the tests accentuate our method's robust cross-camera generalisation capabilities.We intend to release both the dataset and code in https://nimolty.github.io/Robokeygen/ △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted by ICRA 2024

arXiv:2403.17367 [pdf, other]

RoboDuet: A Framework Affording Mobile-Manipulation and Cross-Embodiment

Authors: Guo** Pan, Qingwei Ben, Zhecheng Yuan, Guangqi Jiang, Yandong Ji, Jiangmiao Pang, Houde Liu, Huazhe Xu

Abstract: Combining the mobility of legged robots with the manipulation skills of arms has the potential to significantly expand the operational range and enhance the capabilities of robotic systems in performing various mobile manipulation tasks. Existing approaches are confined to imprecise six degrees of freedom (DoF) manipulation and possess a limited arm workspace. In this paper, we propose a novel fra… ▽ More Combining the mobility of legged robots with the manipulation skills of arms has the potential to significantly expand the operational range and enhance the capabilities of robotic systems in performing various mobile manipulation tasks. Existing approaches are confined to imprecise six degrees of freedom (DoF) manipulation and possess a limited arm workspace. In this paper, we propose a novel framework, RoboDuet, which employs two collaborative policies to realize locomotion and manipulation simultaneously, achieving whole-body control through interactions between each other. Surprisingly, going beyond the large-range pose tracking, we find that the two-policy framework may enable cross-embodiment deployment such as using different quadrupedal robots or other arms. Our experiments demonstrate that the policies trained through RoboDuet can accomplish stable gaits, agile 6D end-effector pose tracking, and zero-shot exchange of legged robots, and can be deployed in the real world to perform various mobile manipulation tasks. Our project page with demo videos is at https://locomanip-duet.github.io . △ Less

Submitted 13 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.08821 [pdf, other]

Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

Authors: Jiaxin Deng, Junbiao Pang, Baochang Zhang, Tian Wang

Abstract: Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient o… ▽ More Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures. △ Less

Submitted 24 February, 2024; originally announced March 2024.

arXiv:2402.18445 [pdf, other]

HyperFedNet: Communication-Efficient Personalized Federated Learning Via Hypernetwork

Authors: Xingyun Chen, Yan Huang, Zhenzhen Xie, Junjie Pang

Abstract: In response to the challenges posed by non-independent and identically distributed (non-IID) data and the escalating threat of privacy attacks in Federated Learning (FL), we introduce HyperFedNet (HFN), a novel architecture that incorporates hypernetworks to revolutionize parameter aggregation and transmission in FL. Traditional FL approaches, characterized by the transmission of extensive paramet… ▽ More In response to the challenges posed by non-independent and identically distributed (non-IID) data and the escalating threat of privacy attacks in Federated Learning (FL), we introduce HyperFedNet (HFN), a novel architecture that incorporates hypernetworks to revolutionize parameter aggregation and transmission in FL. Traditional FL approaches, characterized by the transmission of extensive parameters, not only incur significant communication overhead but also present vulnerabilities to privacy breaches through gradient analysis. HFN addresses these issues by transmitting a concise set of hypernetwork parameters, thereby reducing communication costs and enhancing privacy protection. Upon deployment, the HFN algorithm enables the dynamic generation of parameters for the basic layer of the FL main network, utilizing local database features quantified by embedding vectors as input. Through extensive experimentation, HFN demonstrates superior performance in reducing communication overhead and improving model accuracy compared to conventional FL methods. By integrating the HFN algorithm into the FL framework, HFN offers a solution to the challenges of non-IID data and privacy threats. △ Less

Submitted 2 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.16174 [pdf, other]

GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction

Authors: Xiao Chen, Quanyi Li, Tai Wang, Tianfan Xue, Jiangmiao Pang

Abstract: While recent advances in neural radiance field enable realistic digitization for large-scale scenes, the image-capturing process is still time-consuming and labor-intensive. Previous works attempt to automate this process using the Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing NBV policies heavily rely on hand-crafted criteria, limited action space, or per-scene o… ▽ More While recent advances in neural radiance field enable realistic digitization for large-scale scenes, the image-capturing process is still time-consuming and labor-intensive. Previous works attempt to automate this process using the Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing NBV policies heavily rely on hand-crafted criteria, limited action space, or per-scene optimized representations. These constraints limit their cross-dataset generalizability. To overcome them, we propose GenNBV, an end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning (RL)-based framework and extends typical limited action space to 5D free space. It empowers our agent drone to scan from any viewpoint, and even interact with unseen geometries during training. To boost the cross-dataset generalizability, we also propose a novel multi-source state embedding, including geometric, semantic, and action representations. We establish a benchmark using the Isaac Gym simulator with the Houses3K and OmniObject3D datasets to evaluate this NBV policy. Experiments demonstrate that our policy achieves a 98.26% and 97.12% coverage ratio on unseen building-scale objects from these datasets, respectively, outperforming prior solutions. △ Less

Submitted 15 June, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Comments: CVPR 2024. Project page: http://gennbv.github.io/

arXiv:2402.15895 [pdf, other]

Multi-Object Tracking by Hierarchical Visual Representations

Authors: **kun Cao, Jiangmiao Pang, Kris Kitani

Abstract: We propose a new visual hierarchical representation paradigm for multi-object tracking. It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information instead of sticking to only the semantic visual cue such as bounding boxes. This compositional-semantic-contextual hierarchy is flexible to be int… ▽ More We propose a new visual hierarchical representation paradigm for multi-object tracking. It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information instead of sticking to only the semantic visual cue such as bounding boxes. This compositional-semantic-contextual hierarchy is flexible to be integrated in different appearance-based multi-object tracking methods. We also propose an attention-based visual feature module to fuse the hierarchical visual representations. The proposed method achieves state-of-the-art accuracy and time efficiency among query-based methods on multiple multi-object tracking benchmarks. △ Less

Submitted 24 February, 2024; originally announced February 2024.

Comments: 6 pages, 3 figures, 10 tables, accepted by ICRA 2024

arXiv:2402.13497 [pdf, other]

Push Quantization-Aware Training Toward Full Precision Performances via Consistency Regularization

Authors: Junbiao Pang, Tianyang Cai, Baochang Zhang, Jiaqi Wu, Ye Tao

Abstract: Existing Quantization-Aware Training (QAT) methods intensively depend on the complete labeled dataset or knowledge distillation to guarantee the performances toward Full Precision (FP) accuracies. However, empirical results show that QAT still has inferior results compared to its FP counterpart. One question is how to push QAT toward or even surpass FP performances. In this paper, we address this… ▽ More Existing Quantization-Aware Training (QAT) methods intensively depend on the complete labeled dataset or knowledge distillation to guarantee the performances toward Full Precision (FP) accuracies. However, empirical results show that QAT still has inferior results compared to its FP counterpart. One question is how to push QAT toward or even surpass FP performances. In this paper, we address this issue from a new perspective by injecting the vicinal data distribution information to improve the generalization performances of QAT effectively. We present a simple, novel, yet powerful method introducing an Consistency Regularization (CR) for QAT. Concretely, CR assumes that augmented samples should be consistent in the latent feature space. Our method generalizes well to different network architectures and various QAT methods. Extensive experiments demonstrate that our approach significantly outperforms the current state-of-the-art QAT methods and even FP counterparts. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 11 pages, 5 figures

arXiv:2402.12985 [pdf, other]

Lüscher equation with long-range forces

Authors: Rishabh Bubna, Hans-Werner Hammer, Fabian Müller, **-Yi Pang, Akaki Rusetsky, Jia-Jun Wu

Abstract: We derive the modified Lüscher equation in the presence of the long-range force caused by the exchange of a light particle. It is shown that the use of this equation enables one to circumvent the problems related to the strong partial-wave mixing and the t-channel sub-threshold singularities. It is also demonstrated that the present method is intrinsically linked to the so-called modified effectiv… ▽ More We derive the modified Lüscher equation in the presence of the long-range force caused by the exchange of a light particle. It is shown that the use of this equation enables one to circumvent the problems related to the strong partial-wave mixing and the t-channel sub-threshold singularities. It is also demonstrated that the present method is intrinsically linked to the so-called modified effective-range expansion (MERE) in the infinite volume. A detailed comparison with the two recently proposed alternative approaches is provided. △ Less

Submitted 24 April, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 28 pages, 4 figures

arXiv:2402.12789 [pdf, other]

Fairness Without Harm: An Influence-Guided Active Sampling Approach

Authors: **long Pang, Jialu Wang, Zhaowei Zhu, Yuanshun Yao, Chen Qian, Yang Liu

Abstract: The pursuit of fairness in machine learning (ML), ensuring that the models do not exhibit biases toward protected demographic groups, typically results in a compromise scenario. This compromise can be explained by a Pareto frontier where given certain resources (e.g., data), reducing the fairness violations often comes at the cost of lowering the model accuracy. In this work, we aim to train model… ▽ More The pursuit of fairness in machine learning (ML), ensuring that the models do not exhibit biases toward protected demographic groups, typically results in a compromise scenario. This compromise can be explained by a Pareto frontier where given certain resources (e.g., data), reducing the fairness violations often comes at the cost of lowering the model accuracy. In this work, we aim to train models that mitigate group fairness disparity without causing harm to model accuracy. Intuitively, acquiring more data is a natural and promising approach to achieve this goal by reaching a better Pareto frontier of the fairness-accuracy tradeoff. The current data acquisition methods, such as fair active learning approaches, typically require annotating sensitive attributes. However, these sensitive attribute annotations should be protected due to privacy and safety concerns. In this paper, we propose a tractable active data sampling algorithm that does not rely on training group annotations, instead only requiring group annotations on a small validation set. Specifically, the algorithm first scores each new example by its influence on fairness and accuracy evaluated on the validation dataset, and then selects a certain number of examples for training. We theoretically analyze how acquiring more data can improve fairness without causing harm, and validate the possibility of our sampling approach in the context of risk disparity. We also provide the upper bound of generalization error and risk disparity as well as the corresponding connections. Extensive experiments on real-world data demonstrate the effectiveness of our proposed algorithm. △ Less

Submitted 31 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.12238 [pdf, other]

Mixed Gaussian Flow for Diverse Trajectory Prediction

Authors: Jiahe Chen, **kun Cao, Dahua Lin, Kris Kitani, Jiangmiao Pang

Abstract: Existing trajectory prediction studies intensively leverage generative models. Normalizing flow is one of the genres with the advantage of being invertible to derive the probability density of predicted trajectories. However, map** from a standard Gaussian by a flow-based model hurts the capacity to capture complicated patterns of trajectories, ignoring the under-represented motion intentions in… ▽ More Existing trajectory prediction studies intensively leverage generative models. Normalizing flow is one of the genres with the advantage of being invertible to derive the probability density of predicted trajectories. However, map** from a standard Gaussian by a flow-based model hurts the capacity to capture complicated patterns of trajectories, ignoring the under-represented motion intentions in the training data. To solve the problem, we propose a flow-based model to transform a mixed Gaussian prior into the future trajectory manifold. The model shows a better capacity for generating diverse trajectory patterns. Also, by associating each sub-Gaussian with a certain subspace of trajectories, we can generate future trajectories with controllable motion intentions. In such a fashion, the flow-based model is not encouraged to simply seek the most likelihood of the intended manifold anymore but a family of controlled manifolds with explicit interpretability. Our proposed method is demonstrated to show state-of-the-art performance in the quantitative evaluation of sampling well-aligned trajectories in top-M generated candidates. We also demonstrate that it can generate diverse, controllable, and out-of-distribution trajectories. Code is available at https://github.com/mulplue/MGF. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.07616 [pdf, other]

Anchor-based Large Language Models

Authors: Jianhui Pang, Fanghua Ye, Derek Fai Wong, Xin He, Wanshun Chen, Longyue Wang

Abstract: Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading t… ▽ More Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processing. This study introduces Anchor-based LLMs (AnLLMs), which utilize an innovative anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. This approach enables LLMs to compress sequence information into an anchor token, reducing the keys/values cache and enhancing inference efficiency. Experiments on question-answering benchmarks reveal that AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference. Despite a minor compromise in accuracy, the substantial enhancements of AnLLMs employing the AnSAN technique in resource utilization and computational efficiency underscore their potential for practical LLM applications. △ Less

Submitted 1 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: The paper has been accepted by the ACL2024 conference. Work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab

arXiv:2402.07243 [pdf, other]

PIVOT-Net: Heterogeneous Point-Voxel-Tree-based Framework for Point Cloud Compression

Authors: Jiahao Pang, Kevin Bui, Dong Tian

Abstract: The universality of the point cloud format enables many 3D applications, making the compression of point clouds a critical phase in practice. Sampled as discrete 3D points, a point cloud approximates 2D surface(s) embedded in 3D with a finite bit-depth. However, the point distribution of a practical point cloud changes drastically as its bit-depth increases, requiring different methodologies for e… ▽ More The universality of the point cloud format enables many 3D applications, making the compression of point clouds a critical phase in practice. Sampled as discrete 3D points, a point cloud approximates 2D surface(s) embedded in 3D with a finite bit-depth. However, the point distribution of a practical point cloud changes drastically as its bit-depth increases, requiring different methodologies for effective consumption/analysis. In this regard, a heterogeneous point cloud compression (PCC) framework is proposed. We unify typical point cloud representations -- point-based, voxel-based, and tree-based representations -- and their associated backbones under a learning-based framework to compress an input point cloud at different bit-depth levels. Having recognized the importance of voxel-domain processing, we augment the framework with a proposed context-aware upsampling for decoding and an enhanced voxel transformer for feature aggregation. Extensive experimentation demonstrates the state-of-the-art performance of our proposal on a wide range of point clouds. △ Less

Submitted 11 February, 2024; originally announced February 2024.

Comments: Accepted at 3DV 2024

arXiv:2402.03719 [pdf, other]

Empowering Language Models with Active Inquiry for Deeper Understanding

Authors: **g-Cheng Pang, Heng-Bo Fan, Pengyuan Wang, Jia-Hao Xiao, Nan Tang, Si-Hang Yang, Chengxing Jia, Sheng-Jun Huang, Yang Yu

Abstract: The rise of large language models (LLMs) has revolutionized the way that we interact with artificial intelligence systems through natural language. However, LLMs often misinterpret user queries because of their uncertain intention, leading to less helpful responses. In natural human interactions, clarification is sought through targeted questioning to uncover obscure information. Thus, in this pap… ▽ More The rise of large language models (LLMs) has revolutionized the way that we interact with artificial intelligence systems through natural language. However, LLMs often misinterpret user queries because of their uncertain intention, leading to less helpful responses. In natural human interactions, clarification is sought through targeted questioning to uncover obscure information. Thus, in this paper, we introduce LaMAI (Language Model with Active Inquiry), designed to endow LLMs with this same level of interactive engagement. LaMAI leverages active learning techniques to raise the most informative questions, fostering a dynamic bidirectional dialogue. This approach not only narrows the contextual gap but also refines the output of the LLMs, aligning it more closely with user expectations. Our empirical studies, across a variety of complex datasets where LLMs have limited conversational context, demonstrate the effectiveness of LaMAI. The method improves answer accuracy from 31.9% to 50.9%, outperforming other leading question-answering frameworks. Moreover, in scenarios involving human participants, LaMAI consistently generates responses that are superior or comparable to baseline methods in more than 82% of the cases. The applicability of LaMAI is further evidenced by its successful integration with various LLMs, highlighting its potential for the future of interactive language models. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.03672 [pdf, other]

The spin alignment of rho mesons in a pion gas

Authors: Yi-Liang Yin, Wen-Bo Dong, **-Yi Pang, Shi Pu, Qun Wang

Abstract: We study the spin alignment of neutral rho mesons in a pion gas using spin kinetic or Boltzmann equations. The $ρππ$ coupling is given by the chiral effective theory. The collision terms at the leading and next-to-leading order in spin Boltzmann equations are derived. The evolution of the spin density matrix of the neutral rho meson is simulated with different initial conditions. The numerical res… ▽ More We study the spin alignment of neutral rho mesons in a pion gas using spin kinetic or Boltzmann equations. The $ρππ$ coupling is given by the chiral effective theory. The collision terms at the leading and next-to-leading order in spin Boltzmann equations are derived. The evolution of the spin density matrix of the neutral rho meson is simulated with different initial conditions. The numerical results show that the interaction of pions and neutral rho mesons creates very small spin alignment in the central rapidity region if there is no rho meson in the system at the initial time. Such a small spin alignment in the central rapidity region will decay rapidly toward zero in later time. If there are rho mesons with a sizable spin alignment at the initial time the spin alignment will also decrease rapidly. We also considered the effect on $ρ_{00}$ from the elliptic flow of pions in the blast wave model. With vanishing spin alignment at the initial time, the deviation of $ρ_{00}$ from 1/3 is positive but very small. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: RevTex 4, 17 pages, 12 figures

arXiv:2401.12794 [pdf, other]

Benchmarking LLMs via Uncertainty Quantification

Authors: Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, Zhaopeng Tu

Abstract: The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking… ▽ More The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves eight LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs. △ Less

Submitted 25 April, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: 25 pages, preprints

arXiv:2401.11386 [pdf]

Exploring Intrinsic Magnetic Topological Insulators: The Case of EuIn$_2$As$_2$

Authors: Hao Liu, Qi-Yi Wu, Chen Zhang, Jie Pang, Bo Chen, Jiao-Jiao Song, Yu-Xia Duan, Ya-Hua Yuan, Hai-Yun Liu, Chuan-Cun Shu, Yuan-Feng Xu, You-Guo Shi, Jian-Qiao Meng

Abstract: In this study, ultrafast optical spectroscopy was employed to elucidate the intricate topological features of EuIn$_2$As$_2$, a promising candidate for a magnetic topological-crystalline axion insulator. Our investigation, focusing on the real-time evolution of topological states, unveiled a narrow surface magnetic gap (2$Δ_0$ $\simeq$ 8.2 meV)) emerging at the antiferromagnetic transition tempera… ▽ More In this study, ultrafast optical spectroscopy was employed to elucidate the intricate topological features of EuIn$_2$As$_2$, a promising candidate for a magnetic topological-crystalline axion insulator. Our investigation, focusing on the real-time evolution of topological states, unveiled a narrow surface magnetic gap (2$Δ_0$ $\simeq$ 8.2 meV)) emerging at the antiferromagnetic transition temperature ($T_N$ $\approx$ 16 K). Below $T_N$, two extremely low-energy collective modes, $ω_1$ and $ω_2$, with frequencies of $\sim$9.9 and 21.6 GHz at $T$ = 4 K, respectively, were observed, exhibiting strong temperature dependence. $ω_1$ correlates with an acoustic phonon, while $ω_2$ is associated with a magnon. The results suggest that EuIn$_2$As$_2$ has the potential to manifest a magnetic topological-crystalline axion insulator, presenting a small magnetic energy gap on the (001) surface. The findings further our understanding of the interplay between magnetism and topology in this material, showcasing its potential for applications in quantum information processing and spintronics. △ Less

Submitted 20 January, 2024; originally announced January 2024.

Comments: 6 pages, 4 figures

arXiv:2401.08350 [pdf, other]

Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models

Authors: Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F. Wong, Shuming Shi, Zhaopeng Tu

Abstract: The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, t… ▽ More The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings indicate that LLMs effectively lessen the reliance on parallel data for major languages in the pretraining phase. Additionally, the LLM-based translation system significantly enhances the translation of long sentences that contain approximately 80 words and shows the capability to translate documents of up to 512 words. However, despite these significant improvements, the challenges of domain mismatch and prediction of rare words persist. While the challenges of word alignment and beam search, specifically associated with NMT, may not apply to LLMs, we identify three new challenges for LLMs in translation tasks: inference efficiency, translation of low-resource languages in the pretraining phase, and human-aligned evaluation. The datasets and models are released at https://github.com/pangjh3/LLM4MT. △ Less

Submitted 17 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: 17 pages. Longyue Wang is the Corresponding Author

arXiv:2401.01565 [pdf, ps, other]

Classification and Treatment Learning with Constraints via Composite Heaviside Optimization: a Progressive MIP Method

Authors: Yue Fang, Junyi Liu, Jong-Shi Pang

Abstract: This paper proposes a Heaviside composite optimization approach and presents a progressive (mixed) integer programming (PIP) method for solving multi-class classification and multi-action treatment problems with constraints. A Heaviside composite function is a composite of a Heaviside function (i.e., the indicator function of either the open $( \, 0,\infty )$ or closed $[ \, 0,\infty \, )$ interva… ▽ More This paper proposes a Heaviside composite optimization approach and presents a progressive (mixed) integer programming (PIP) method for solving multi-class classification and multi-action treatment problems with constraints. A Heaviside composite function is a composite of a Heaviside function (i.e., the indicator function of either the open $( \, 0,\infty )$ or closed $[ \, 0,\infty \, )$ interval) with a possibly nondifferentiable function. Modeling-wise, we show how Heaviside composite optimization provides a unified formulation for learning the optimal multi-class classification and multi-action treatment rules, subject to rule-dependent constraints stipulating a variety of domain restrictions. A Heaviside composite function has an equivalent discrete formulation, and the resulting optimization problem can in principle be solved by integer programming (IP) methods. Nevertheless, for constrained learning problems with large data sets, a straightforward application of off-the-shelf IP solvers is usually ineffective in achieving global optimality. To alleviate such a computational burden, our major contribution is the proposal of the PIP method by leveraging the effectiveness of state-of-the-art IP solvers for problems of modest sizes. We provide the theoretical advantage of the PIP method with the connection to continuous optimization and show that the computed solution is locally optimal for a broad class of Heaviside composite optimization problems. The numerical performance of the PIP method is demonstrated by extensive computational experimentation. △ Less

Submitted 4 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

ACM Class: G.1.6

arXiv:2312.16170 [pdf, other]

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Authors: Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, Jiangmiao Pang

Abstract: In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To addre… ▽ More In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories. Building upon this database, we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and in the wild. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: A multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. Project page: http://tai-wang.github.io/embodiedscan

arXiv:2312.11460 [pdf, other]

Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response

Authors: Junfeng Long, Zirui Wang, Quanyi Li, Jiawei Gao, Liu Cao, Jiangmiao Pang

Abstract: Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introdu… ▽ More Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introduce Hybrid Internal Model (HIM) to estimate them according to the response of the robot. The response, which we refer to as the hybrid internal embedding, contains the robot's explicit velocity and implicit stability representation, corresponding to two primary goals for locomotion tasks: explicitly tracking velocity and implicitly maintaining stability. We use contrastive learning to optimize the embedding to be close to the robot's successor state, in which the response is naturally embedded. HIM has several appealing benefits: It only needs the robot's proprioceptions, i.e., those from joint encoders and IMU as observations. It innovatively maintains consistent observations between simulation reference and reality that avoids information loss in mimicking learning. It exploits batch-level information that is more robust to noises and keeps better sample efficiency. It only requires 1 hour of training on an RTX 4090 to enable a quadruped robot to traverse any terrain under any disturbances. A wealth of real-world experiments demonstrates its agility, even in high-difficulty tasks and cases never occurred during the training process, revealing remarkable open-world generalizability. △ Less

Submitted 1 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Use 1 hour to train a quadruped robot capable of traversing any terrain under any disturbances in the open world, Project Page: https://github.com/OpenRobotLab/HIMLoco

arXiv:2312.04391 [pdf, other]

Lellouch-Lüscher factor for the $K\to 3π$ decays

Authors: **-Yi Pang, Rishabh Bubna, Fabian Müller, Akaki Rusetsky, Jia-Jun Wu

Abstract: We derive an explicit expression for the Lellouch-Lüscher (LL) factor in the $K\to 3π$ decays at leading order (without derivative couplings). Several important technical details are addressed, like a proper decomposition into the isospin amplitudes, the choice of a minimal set of effective couplings and the renormalization, as well as the algorithm for the solution of the pertinent Faddeev equati… ▽ More We derive an explicit expression for the Lellouch-Lüscher (LL) factor in the $K\to 3π$ decays at leading order (without derivative couplings). Several important technical details are addressed, like a proper decomposition into the isospin amplitudes, the choice of a minimal set of effective couplings and the renormalization, as well as the algorithm for the solution of the pertinent Faddeev equations in the infinite volume which is based on the contour deformation method. Most importantly, our numerical results demonstrate that the three-body force contributes very little to the LL factor. This result paves the way for the study of the $K\to 3π$ decays on the lattice. △ Less

Submitted 26 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: 41 pages, 14 figures

arXiv:2312.00423 [pdf]

Unravelling spontaneous Bloch-type skyrmion in centrosymmetric two-dimensional magnets

Authors: **gman Pang, Xiaohang Niu, Hong Jian Zhao, Yun Zhang, Laurent Bellaiche

Abstract: The realization of magnetic skyrmions in two-dimensional (2D) magnets holds great promise for both fundamental research and device applications. Despite recent progress, two-dimensional skyrmion hosts are still limited, due to the fact that most 2D magnets are centrosymmetric and thus lack Dzyaloshinskii-Moriya interaction (DMI). We show here, using a general analysis based on symmetry, that Bloch… ▽ More The realization of magnetic skyrmions in two-dimensional (2D) magnets holds great promise for both fundamental research and device applications. Despite recent progress, two-dimensional skyrmion hosts are still limited, due to the fact that most 2D magnets are centrosymmetric and thus lack Dzyaloshinskii-Moriya interaction (DMI). We show here, using a general analysis based on symmetry, that Bloch-type skyrmions can, in fact, be stabilized in 2D magnets, due to the interplay between in-plane component (dx) of second nearest-neighbor DMI and magnetic anisotropy. Its validity is demonstrated in the Cr2Ge2Te6 monolayer, which is also verified by recent experiments. Our work gives a clear direction for experimental studies of 2D magnetic materials to stabilize skyrmions and should greatly enrich the research on magnetic skyrmions in 2D lattices. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2312.00335 [pdf, other]

Learning Anatomically Consistent Embedding for Chest Radiography

Authors: Ziyu Zhou, Haozhe Luo, Jiaxuan Pang, Xiaowei Ding, Michael Gotway, Jianming Liang

Abstract: Self-supervised learning (SSL) approaches have recently shown substantial success in learning visual representations from unannotated images. Compared with photographic images, medical images acquired with the same imaging protocol exhibit high consistency in anatomy. To exploit this anatomical consistency, this paper introduces a novel SSL approach, called PEAC (patch embedding of anatomical cons… ▽ More Self-supervised learning (SSL) approaches have recently shown substantial success in learning visual representations from unannotated images. Compared with photographic images, medical images acquired with the same imaging protocol exhibit high consistency in anatomy. To exploit this anatomical consistency, this paper introduces a novel SSL approach, called PEAC (patch embedding of anatomical consistency), for medical image analysis. Specifically, in this paper, we propose to learn global and local consistencies via stable grid-based matching, transfer pre-trained PEAC models to diverse downstream tasks, and extensively demonstrate that (1) PEAC achieves significantly better performance than the existing state-of-the-art fully/self-supervised methods, and (2) PEAC captures the anatomical structure consistency across views of the same patient and across patients of different genders, weights, and healthy statuses, which enhances the interpretability of our method for medical image analysis. △ Less

Submitted 11 June, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

Comments: BMVC 2023, oral

arXiv:2311.01782 [pdf, other]

Generating Unbiased Pseudo-labels via a Theoretically Guaranteed Chebyshev Constraint to Unify Semi-supervised Classification and Regression

Authors: Jiaqi Wu, Junbiao Pang, Qingming Huang

Abstract: Both semi-supervised classification and regression are practically challenging tasks for computer vision. However, semi-supervised classification methods are barely applied to regression tasks. Because the threshold-to-pseudo label process (T2L) in classification uses confidence to determine the quality of label. It is successful for classification tasks but inefficient for regression tasks. In na… ▽ More Both semi-supervised classification and regression are practically challenging tasks for computer vision. However, semi-supervised classification methods are barely applied to regression tasks. Because the threshold-to-pseudo label process (T2L) in classification uses confidence to determine the quality of label. It is successful for classification tasks but inefficient for regression tasks. In nature, regression also requires unbiased methods to generate high-quality labels. On the other hand, T2L for classification often fails if the confidence is generated by a biased method. To address this issue, in this paper, we propose a theoretically guaranteed constraint for generating unbiased labels based on Chebyshev's inequality, combining multiple predictions to generate superior quality labels from several inferior ones. In terms of high-quality labels, the unbiased method naturally avoids the drawback of T2L. Specially, we propose an Unbiased Pseudo-labels network (UBPL network) with multiple branches to combine multiple predictions as pseudo-labels, where a Feature Decorrelation loss (FD loss) is proposed based on Chebyshev constraint. In principle, our method can be used for both classification and regression and can be easily extended to any semi-supervised framework, e.g. Mean Teacher, FixMatch, DualPose. Our approach achieves superior performance over SOTAs on the pose estimation datasets Mouse, FLIC and LSP, as well as the classification datasets CIFAR10/100 and SVHN. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2311.01770 [pdf, other]

Modeling the Uncertainty with Maximum Discrepant Students for Semi-supervised 2D Pose Estimation

Authors: Jiaqi Wu, Junbiao Pang, Qingming Huang

Abstract: Semi-supervised pose estimation is a practically challenging task for computer vision. Although numerous excellent semi-supervised classification methods have emerged, these methods typically use confidence to evaluate the quality of pseudo-labels, which is difficult to achieve in pose estimation tasks. For example, in pose estimation, confidence represents only the possibility that a position of… ▽ More Semi-supervised pose estimation is a practically challenging task for computer vision. Although numerous excellent semi-supervised classification methods have emerged, these methods typically use confidence to evaluate the quality of pseudo-labels, which is difficult to achieve in pose estimation tasks. For example, in pose estimation, confidence represents only the possibility that a position of the heatmap is a keypoint, not the quality of that prediction. In this paper, we propose a simple yet efficient framework to estimate the quality of pseudo-labels in semi-supervised pose estimation tasks from the perspective of modeling the uncertainty of the pseudo-labels. Concretely, under the dual mean-teacher framework, we construct the two maximum discrepant students (MDSs) to effectively push two teachers to generate different decision boundaries for the same sample. Moreover, we create multiple uncertainties to assess the quality of the pseudo-labels. Experimental results demonstrate that our method improves the performance of semi-supervised pose estimation on three datasets. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2310.09507 [pdf, other]

doi 10.1007/978-3-031-43907-0_62

Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance

Authors: DongAo Ma, Jiaxuan Pang, Michael B. Gotway, Jianming Liang

Abstract: Deep learning nowadays offers expert-level and sometimes even super-expert-level performance, but achieving such performance demands massive annotated data for training (e.g., Google's proprietary CXR Foundation Model (CXR-FM) was trained on 821,544 labeled and mostly private chest X-rays (CXRs)). Numerous datasets are publicly available in medical imaging but individually small and heterogeneous… ▽ More Deep learning nowadays offers expert-level and sometimes even super-expert-level performance, but achieving such performance demands massive annotated data for training (e.g., Google's proprietary CXR Foundation Model (CXR-FM) was trained on 821,544 labeled and mostly private chest X-rays (CXRs)). Numerous datasets are publicly available in medical imaging but individually small and heterogeneous in expert labels. We envision a powerful and robust foundation model that can be trained by aggregating numerous small public datasets. To realize this vision, we have developed Ark, a framework that accrues and reuses knowledge from heterogeneous expert annotations in various datasets. As a proof of concept, we have trained two Ark models on 335,484 and 704,363 CXRs, respectively, by merging several datasets including ChestX-ray14, CheXpert, MIMIC-II, and VinDr-CXR, evaluated them on a wide range of imaging tasks covering both classification and segmentation via fine-tuning, linear-probing, and gender-bias analysis, and demonstrated our Ark's superior and robust performance over the SOTA fully/self-supervised baselines and Google's proprietary CXR-FM. This enhanced performance is attributed to our simple yet powerful observation that aggregating numerous public datasets diversifies patient populations and accrues knowledge from diverse experts, yielding unprecedented performance yet saving annotation cost. With all codes and pretrained models released at GitHub.com/JLiangLab/Ark, we hope that Ark exerts an important impact on open science, as accruing and reusing knowledge from expert annotations in public datasets can potentially surpass the performance of proprietary models trained on unusually large data, inspiring many more researchers worldwide to share codes and datasets to build open foundation models, accelerate open science, and democratize deep learning for medical imaging. △ Less

Submitted 14 October, 2023; originally announced October 2023.

Comments: Best Paper Award Runner-Up at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2023

Showing 1–50 of 296 results for author: Pang, J