Search | arXiv e-print repository

Learning to Cut via Hierarchical Sequence/Set Model for Efficient Mixed-Integer Programming

Authors: Jie Wang, Zhihai Wang, Xijun Li, Yufei Kuang, Zhihao Shi, Fangzhou Zhu, Mingxuan Yuan, Jia Zeng, Yongdong Zhang, Feng Wu

Abstract: Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. Howev… ▽ More Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. However, many existing learning-based methods learn which cuts to prefer, neglecting the importance of learning how many cuts to select. Moreover, we observe that (P3) what order of selected cuts to prefer significantly impacts the efficiency of MILP solvers as well. To address these challenges, we propose a novel hierarchical sequence/set model (HEM) to learn cut selection policies. Specifically, HEM is a bi-level model: (1) a higher-level module that learns how many cuts to select, (2) and a lower-level module -- that formulates the cut selection as a sequence/set to sequence learning problem -- to learn policies selecting an ordered subset with the cardinality determined by the higher-level module. To the best of our knowledge, HEM is the first data-driven methodology that well tackles (P1)-(P3) simultaneously. Experiments demonstrate that HEM significantly improves the efficiency of solving MILPs on eleven challenging MILP benchmarks, including two Huawei's real problems. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2302.00244

arXiv:2404.09814 [pdf, other]

A Novel HARQ-CC Assisted SCMA Scheme

Authors: Man Wang, Zheng Shi, Yunfei Li, Xianda Wu, Weiqiang Tan, Xinrong Ye

Abstract: This letter proposes a novel hybrid automatic repeat request with chase combining assisted sparse code multiple access (HARQ-CC-SCMA) scheme. Depending on whether the same superimposed packet are retransmitted, synchronous and asynchronous modes are considered for retransmissions. Moreover, factor graph aggregation (FGA) and Log-likelihood ratio combination (LLRC) are proposed for multi-user detec… ▽ More This letter proposes a novel hybrid automatic repeat request with chase combining assisted sparse code multiple access (HARQ-CC-SCMA) scheme. Depending on whether the same superimposed packet are retransmitted, synchronous and asynchronous modes are considered for retransmissions. Moreover, factor graph aggregation (FGA) and Log-likelihood ratio combination (LLRC) are proposed for multi-user detection. Regarding FGA, a large-scale factor graph is constructed by combining all the received superimposed signals and message passing algorithm (MAP) is applied to calculate log-likelihood ratio (LLR). Whereas, owing to the same unsuccessful messages required to be retransmitted, LLRC adds up LLRs of erroneously received packets in previous HARQ rounds together with currently received packets for channel decoding and saves the LLRs for failed users. Finally, Monte Carlo simulations are preformed to show that FGA surpasses LLRC and HARQ with incremental redundancy (HARQ-IR) in synchronous mode. However, LLRC performs better than FGA at low signal-to-noise ratio (SNR) in asynchronous mode. This is because failed messages after the maximum allowable HARQ rounds in this mode can yield significant error propagation in low SNR regime. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.08211 [pdf, other]

Fractal spectrum in twisted bilayer optical lattice

Authors: Xu-Tao Wan, Chao Gao, Zhe-Yu Shi

Abstract: The translation symmetry of a lattice is greatly modified when subjected to a perpendicular magnetic field [Zak, Phys. Rev. \textbf{134}, A1602 (1964)]. This change in symmetry can lead to magnetic unit cells that are substantially larger than the original ones. Similarly, the translation properties of a double-layered lattice alters drastically while two monolayers are relatively twisted by a sma… ▽ More The translation symmetry of a lattice is greatly modified when subjected to a perpendicular magnetic field [Zak, Phys. Rev. \textbf{134}, A1602 (1964)]. This change in symmetry can lead to magnetic unit cells that are substantially larger than the original ones. Similarly, the translation properties of a double-layered lattice alters drastically while two monolayers are relatively twisted by a small angle, resulting in large-scale moiré unit cells. Intrigued by the resemblance, we calculate the complete band structures of a twisted bilayer optical lattice and show that the geometric moiré effect can induce fractal band structures. The fractals are controlled by the twist angle between two monolayers and are closely connected to the celebrated butterfly spectrum of two-dimensional Bloch electrons in a magnetic field [Hofstadter, Phys. Rev. B \textbf{14}, 2239 (1976)]. We demonstrate this by proving that the twisted bilayer optical lattice can be mapped to a generalized Hofstadter's model with long-range hop**. Furthermore, we provide numerical evidence on the infinite recursive structures of the spectrum and give an algorithm for computing these structures. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 6 pages, 3 figures. Comments are welcome!

arXiv:2404.07956 [pdf, other]

Lyapunov-stable Neural Control for State and Output Feedback: A Novel Formulation

Authors: Lujie Yang, Hongkai Dai, Zhouxing Shi, Cho-Jui Hsieh, Russ Tedrake, Huan Zhang

Abstract: Learning-based neural network (NN) control policies have shown impressive empirical performance in a wide range of tasks in robotics and control. However, formal (Lyapunov) stability guarantees over the region-of-attraction (ROA) for NN controllers with nonlinear dynamical systems are challenging to obtain, and most existing approaches rely on expensive solvers such as sums-of-squares (SOS), mixed… ▽ More Learning-based neural network (NN) control policies have shown impressive empirical performance in a wide range of tasks in robotics and control. However, formal (Lyapunov) stability guarantees over the region-of-attraction (ROA) for NN controllers with nonlinear dynamical systems are challenging to obtain, and most existing approaches rely on expensive solvers such as sums-of-squares (SOS), mixed-integer programming (MIP), or satisfiability modulo theories (SMT). In this paper, we demonstrate a new framework for learning NN controllers together with Lyapunov certificates using fast empirical falsification and strategic regularizations. We propose a novel formulation that defines a larger verifiable region-of-attraction (ROA) than shown in the literature, and refines the conventional restrictive constraints on Lyapunov derivatives to focus only on certifiable ROAs. The Lyapunov condition is rigorously verified post-hoc using branch-and-bound with scalable linear bound propagation-based NN verification techniques. The approach is efficient and flexible, and the full training and verification procedure is accelerated on GPUs without relying on expensive solvers for SOS, MIP, nor SMT. The flexibility and efficiency of our framework allow us to demonstrate Lyapunov-stable output feedback control with synthesized NN-based controllers and NN-based observers with formal stability guarantees, for the first time in literature. Source code at https://github.com/Verified-Intelligence/Lyapunov_Stable_NN_Controllers △ Less

Submitted 4 June, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

Comments: Paper accepted by ICML 2024

arXiv:2404.04155 [pdf, other]

MarsSeg: Mars Surface Semantic Segmentation with Multi-level Extractor and Connector

Authors: Junbo Li, Keyan Chen, Gengju Tian, Lu Li, Zhenwei Shi

Abstract: The segmentation and interpretation of the Martian surface play a pivotal role in Mars exploration, providing essential data for the trajectory planning and obstacle avoidance of rovers. However, the complex topography, similar surface features, and the lack of extensive annotated data pose significant challenges to the high-precision semantic segmentation of the Martian surface. To address these… ▽ More The segmentation and interpretation of the Martian surface play a pivotal role in Mars exploration, providing essential data for the trajectory planning and obstacle avoidance of rovers. However, the complex topography, similar surface features, and the lack of extensive annotated data pose significant challenges to the high-precision semantic segmentation of the Martian surface. To address these challenges, we propose a novel encoder-decoder based Mars segmentation network, termed MarsSeg. Specifically, we employ an encoder-decoder structure with a minimized number of down-sampling layers to preserve local details. To facilitate a high-level semantic understanding across the shadow multi-level feature maps, we introduce a feature enhancement connection layer situated between the encoder and decoder. This layer incorporates Mini Atrous Spatial Pyramid Pooling (Mini-ASPP), Polarized Self-Attention (PSA), and Strip Pyramid Pooling Module (SPPM). The Mini-ASPP and PSA are specifically designed for shadow feature enhancement, thereby enabling the expression of local details and small objects. Conversely, the SPPM is employed for deep feature enhancement, facilitating the extraction of high-level semantic category-related information. Experimental results derived from the Mars-Seg and AI4Mars datasets substantiate that the proposed MarsSeg outperforms other state-of-the-art methods in segmentation performance, validating the efficacy of each proposed component. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.04102 [pdf, other]

ROPO: Robust Preference Optimization for Large Language Models

Authors: Xize Liang, Chao Chen, Shuang Qiu, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, Jie** Ye

Abstract: Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the impact of noise without the ability to actually reduce its presence, or rely on costly teacher LLMs… ▽ More Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the impact of noise without the ability to actually reduce its presence, or rely on costly teacher LLMs prone to reward misgeneralization. To address these challenges, we propose the RObust Preference Optimization (ROPO) framework, an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Specifically, ROPO iteratively solves a constrained optimization problem, where we dynamically assign a quality-aware weight for each sample and constrain the sum of the weights to the number of samples we intend to retain. For noise-tolerant training and effective noise identification, we derive a robust loss by suppressing the gradients of samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is critical for distinguishing noisy samples from clean ones. Furthermore, inspired by our derived loss, we propose a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods, with its superiority growing as the noise rate increases. △ Less

Submitted 28 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.01082 [pdf, other]

The state-of-the-art in Cardiac MRI Reconstruction: Results of the CMRxRecon Challenge in MICCAI 2023

Authors: Jun Lyu, Chen Qin, Shuo Wang, Fanwen Wang, Yan Li, Zi Wang, Kunyuan Guo, Cheng Ouyang, Michael Tänzer, Meng Liu, Longyu Sun, Mengting Sun, Qin Li, Zhang Shi, Sha Hua, Hao Li, Zhensen Chen, Zhenlin Zhang, Bingyu Xin, Dimitris N. Metaxas, George Yiasemis, Jonas Teuwen, Li** Zhang, Weitian Chen, Yidong Zhao , et al. (25 additional authors not shown)

Abstract: Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation p… ▽ More Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation platform hinder the development of data-driven reconstruction algorithms. To address this issue, we organized the Cardiac MRI Reconstruction Challenge (CMRxRecon) in 2023, in collaboration with the 26th International Conference on MICCAI. CMRxRecon presented an extensive k-space dataset comprising cine and map** raw data, accompanied by detailed annotations of cardiac anatomical structures. With overwhelming participation, the challenge attracted more than 285 teams and over 600 participants. Among them, 22 teams successfully submitted Docker containers for the testing phase, with 7 teams submitted for both cine and map** tasks. All teams use deep learning based approaches, indicating that deep learning has predominately become a promising solution for the problem. The first-place winner of both tasks utilizes the E2E-VarNet architecture as backbones. In contrast, U-Net is still the most popular backbone for both multi-coil and single-coil reconstructions. This paper provides a comprehensive overview of the challenge design, presents a summary of the submitted results, reviews the employed methods, and offers an in-depth discussion that aims to inspire future advancements in cardiac MRI reconstruction models. The summary emphasizes the effective strategies observed in Cardiac MRI reconstruction, including backbone architecture, loss function, pre-processing techniques, physical modeling, and model complexity, thereby providing valuable insights for further developments in this field. △ Less

Submitted 16 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 25 pages, 17 figures

arXiv:2404.00938 [pdf, ps, other]

How Can Large Language Models Enable Better Socially Assistive Human-Robot Interaction: A Brief Survey

Authors: Zhonghao Shi, Ellen Landrum, Amy O' Connell, Mina Kian, Leticia Pinto-Alva, Kaleen Shrestha, Xiaoyuan Zhu, Maja J Matarić

Abstract: Socially assistive robots (SARs) have shown great success in providing personalized cognitive-affective support for user populations with special needs such as older adults, children with autism spectrum disorder (ASD), and individuals with mental health challenges. The large body of work on SAR demonstrates its potential to provide at-home support that complements clinic-based interventions deliv… ▽ More Socially assistive robots (SARs) have shown great success in providing personalized cognitive-affective support for user populations with special needs such as older adults, children with autism spectrum disorder (ASD), and individuals with mental health challenges. The large body of work on SAR demonstrates its potential to provide at-home support that complements clinic-based interventions delivered by mental health professionals, making these interventions more effective and accessible. However, there are still several major technical challenges that hinder SAR-mediated interactions and interventions from reaching human-level social intelligence and efficacy. With the recent advances in large language models (LLMs), there is an increased potential for novel applications within the field of SAR that can significantly expand the current capabilities of SARs. However, incorporating LLMs introduces new risks and ethical concerns that have not yet been encountered, and must be carefully be addressed to safely deploy these more advanced systems. In this work, we aim to conduct a brief survey on the use of LLMs in SAR technologies, and discuss the potentials and risks of applying LLMs to the following three major technical challenges of SAR: 1) natural language dialog; 2) multimodal understanding; 3) LLMs as robot policies. △ Less

Submitted 5 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 2 pages, accepted to the Proceedings of the AAAI Symposium Series, 2024

arXiv:2404.00863 [pdf, other]

Voice Conversion Augmentation for Speaker Recognition on Defective Datasets

Authors: Ruijie Tao, Zhan Shi, Yidi Jiang, Tianchi Liu, Haizhou Li

Abstract: Modern speaker recognition system relies on abundant and balanced datasets for classification training. However, diverse defective datasets, such as partially-labelled, small-scale, and imbalanced datasets, are common in real-world applications. Previous works usually studied specific solutions for each scenario from the algorithm perspective. However, the root cause of these problems lies in data… ▽ More Modern speaker recognition system relies on abundant and balanced datasets for classification training. However, diverse defective datasets, such as partially-labelled, small-scale, and imbalanced datasets, are common in real-world applications. Previous works usually studied specific solutions for each scenario from the algorithm perspective. However, the root cause of these problems lies in dataset imperfections. To address these challenges with a unified solution, we propose the Voice Conversion Augmentation (VCA) strategy to obtain pseudo speech from the training set. Furthermore, to guarantee generation quality, we designed the VCA-NN~(nearest neighbours) strategy to select source speech from utterances that are close to the target speech in the representation space. Our experimental results on three created datasets demonstrated that VCA-NN effectively mitigates these dataset problems, which provides a new direction for handling the speaker recognition problems from the data aspect. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: 5 pages

arXiv:2404.00299 [pdf, other]

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Authors: Juze Zhang, **gyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, **gyi Yu, Lan Xu, **gya Wang

Abstract: Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multi… ▽ More Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M3 dataset, we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M3 dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research. △ Less

Submitted 2 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.20198 [pdf, other]

Minimizing End-to-End Latency for Joint Source-Channel Coding Systems

Authors: Kaiyi Chi, Qianqian Yang, Yuanchao Shu, Zhaohui Yang, Zhiguo Shi

Abstract: While existing studies have highlighted the advantages of deep learning (DL)-based joint source-channel coding (JSCC) schemes in enhancing transmission efficiency, they often overlook the crucial aspect of resource management during the deployment phase. In this paper, we propose an approach to minimize the transmission latency in an uplink JSCC-based system. We first analyze the correlation betwe… ▽ More While existing studies have highlighted the advantages of deep learning (DL)-based joint source-channel coding (JSCC) schemes in enhancing transmission efficiency, they often overlook the crucial aspect of resource management during the deployment phase. In this paper, we propose an approach to minimize the transmission latency in an uplink JSCC-based system. We first analyze the correlation between end-to-end latency and task performance, based on which the end-to-end delay model for each device is established. Then, we formulate a non-convex optimization problem aiming at minimizing the maximum end-to-end latency across all devices, which is proved to be NP-hard. We then transform the original problem into a more tractable one, from which we derive the closed form solution on the optimal compression ratio, truncation threshold selection policy, and resource allocation strategy. We further introduce a heuristic algorithm with low complexity, leveraging insights from the structure of the optimal solution. Simulation results demonstrate that both the proposed optimal algorithm and the heuristic algorithm significantly reduce end-to-end latency. Notably, the proposed heuristic algorithm achieves nearly the same performance to the optimal solution but with considerably lower computational complexity. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 7 Pages, 5 Figures, accepted by 2024 IEEE ICC Workshop

arXiv:2403.19654 [pdf, other]

RSMamba: Remote Sensing Image Classification with State Space Model

Authors: Keyan Chen, Bowen Chen, Chenyang Liu, Wenyuan Li, Zhengxia Zou, Zhenwei Shi

Abstract: Remote sensing image classification forms the foundation of various understanding tasks, serving a crucial function in remote sensing image interpretation. The recent advancements of Convolutional Neural Networks (CNNs) and Transformers have markedly enhanced classification accuracy. Nonetheless, remote sensing scene classification remains a significant challenge, especially given the complexity a… ▽ More Remote sensing image classification forms the foundation of various understanding tasks, serving a crucial function in remote sensing image interpretation. The recent advancements of Convolutional Neural Networks (CNNs) and Transformers have markedly enhanced classification accuracy. Nonetheless, remote sensing scene classification remains a significant challenge, especially given the complexity and diversity of remote sensing scenarios and the variability of spatiotemporal resolutions. The capacity for whole-image understanding can provide more precise semantic cues for scene discrimination. In this paper, we introduce RSMamba, a novel architecture for remote sensing image classification. RSMamba is based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. It integrates the advantages of both a global receptive field and linear modeling complexity. To overcome the limitation of the vanilla Mamba, which can only model causal sequences and is not adaptable to two-dimensional image data, we propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-causal data. Notably, RSMamba maintains the inherent modeling mechanism of the vanilla Mamba, yet exhibits superior performance across multiple remote sensing image classification datasets. This indicates that RSMamba holds significant potential to function as the backbone of future visual foundation models. The code will be available at \url{https://github.com/KyanChen/RSMamba}. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.19646 [pdf, other]

Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis

Authors: Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, Zhenwei Shi

Abstract: Monitoring changes in the Earth's surface is crucial for understanding natural processes and human impacts, necessitating precise and comprehensive interpretation methodologies. Remote sensing satellite imagery offers a unique perspective for monitoring these changes, leading to the emergence of remote sensing image change interpretation (RSICI) as a significant research focus. Current RSICI techn… ▽ More Monitoring changes in the Earth's surface is crucial for understanding natural processes and human impacts, necessitating precise and comprehensive interpretation methodologies. Remote sensing satellite imagery offers a unique perspective for monitoring these changes, leading to the emergence of remote sensing image change interpretation (RSICI) as a significant research focus. Current RSICI technology encompasses change detection and change captioning, each with its limitations in providing comprehensive interpretation. To address this, we propose an interactive Change-Agent, which can follow user instructions to achieve comprehensive change interpretation and insightful analysis according to user instructions, such as change detection and change captioning, change object counting, change cause analysis, etc. The Change-Agent integrates a multi-level change interpretation (MCI) model as the eyes and a large language model (LLM) as the brain. The MCI model contains two branches of pixel-level change detection and semantic-level change captioning, in which multiple BI-temporal Iterative Interaction (BI3) layers utilize Local Perception Enhancement (LPE) and the Global Difference Fusion Attention (GDFA) modules to enhance the model's discriminative feature representation capabilities. To support the training of the MCI model, we build the LEVIR-MCI dataset with a large number of change masks and captions of changes. Extensive experiments demonstrate the effectiveness of the proposed MCI model and highlight the promising potential of our Change-Agent in facilitating comprehensive and intelligent interpretation of surface changes. To facilitate future research, we will make our dataset and codebase of the MCI model and Change-Agent publicly available at https://github.com/Chen-Yang-Liu/Change-Agent △ Less

Submitted 1 April, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.19622 [pdf, other]

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, **g Shao, Yu Qiao, Cewu Lu, Lu Sheng

Abstract: The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, mak… ▽ More The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, making it possible to generalize on novel robotic tasks in a composable manner. Despite the promising future, the community is not yet adequately prepared for composable generalization agents, particularly due to the lack of primitive-level real-world robotic datasets. In this paper, we propose a primitive-level robotic dataset, namely RH20T-P, which contains about 33000 video clips covering 44 diverse and complicated robotic tasks. Each clip is manually annotated according to a set of meticulously designed primitive skills, facilitating the future development of composable generalization agents. To validate the effectiveness of RH20T-P, we also construct a potential and scalable agent based on RH20T-P, called RA-P. Equipped with two planners specialized in task decomposition and motion planning, RA-P can adapt to novel physical skills through composable generalization. Our website and videos can be found at https://sites.google.com/view/rh20t-primitive/main. Dataset and code will be made available soon. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: 24 pages, 12 figures, 6 tables

arXiv:2403.19446 [pdf, other]

EDA-Driven Preprocessing for SAT Solving

Authors: Zhengyuan Shi, Tiebing Tang, Sadaf Khan, Hui-Ling Zhen, Mingxuan Yuan, Zhufei Chu, Qiang Xu

Abstract: Effective formulation of problems into Conjunctive Normal Form (CNF) is critical in modern Boolean Satisfiability (SAT) solving for optimizing solver performance. Addressing the limitations of existing methods, our Electronic Design Automation (EDA)-driven preprocessing framework introduces a novel methodology for preparing SAT instances, leveraging both circuit and CNF formats for enhanced flexib… ▽ More Effective formulation of problems into Conjunctive Normal Form (CNF) is critical in modern Boolean Satisfiability (SAT) solving for optimizing solver performance. Addressing the limitations of existing methods, our Electronic Design Automation (EDA)-driven preprocessing framework introduces a novel methodology for preparing SAT instances, leveraging both circuit and CNF formats for enhanced flexibility and efficiency. Central to our approach is the integration of a new logic synthesis technique, guided by a reinforcement learning agent, and a novel cost-customized LUT map** strategy, enabling efficient handling of diverse SAT challenges. By transforming the SAT competition benchmarks into circuit instances, our framework demonstrates substantial performance improvements, as evidenced by a 52.42% reduction on average compared to solving directly. Moreover, our framework achieves a remarkable 96.14% runtime reduction on average for a set of logic equivalence checking problems that exhibit inherent circuit structures. These results highlight the effectiveness and versatility of our approach in handling both CNF and circuit instances. The code is available at https://github.com/cure-lab/EDA4SAT. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.18134 [pdf, other]

Integrative Graph-Transformer Framework for Histopathology Whole Slide Image Representation and Classification

Authors: Zhan Shi, **gwei Zhang, Jun Kong, Fusheng Wang

Abstract: In digital pathology, the multiple instance learning (MIL) strategy is widely used in the weakly supervised histopathology whole slide image (WSI) classification task where giga-pixel WSIs are only labeled at the slide level. However, existing attention-based MIL approaches often overlook contextual information and intrinsic spatial relationships between neighboring tissue tiles, while graph-based… ▽ More In digital pathology, the multiple instance learning (MIL) strategy is widely used in the weakly supervised histopathology whole slide image (WSI) classification task where giga-pixel WSIs are only labeled at the slide level. However, existing attention-based MIL approaches often overlook contextual information and intrinsic spatial relationships between neighboring tissue tiles, while graph-based MIL frameworks have limited power to recognize the long-range dependencies. In this paper, we introduce the integrative graph-transformer framework that simultaneously captures the context-aware relational features and global WSI representations through a novel Graph Transformer Integration (GTI) block. Specifically, each GTI block consists of a Graph Convolutional Network (GCN) layer modeling neighboring relations at the local instance level and an efficient global attention model capturing comprehensive global information from extensive feature embeddings. Extensive experiments on three publicly available WSI datasets: TCGA-NSCLC, TCGA-RCC and BRIGHT, demonstrate the superiority of our approach over current state-of-the-art MIL methods, achieving an improvement of 1.0% to 2.6% in accuracy and 0.7%-1.6% in AUROC. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.17830 [pdf, other]

Assessment of Multimodal Large Language Models in Alignment with Human Values

Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, **g Shao

Abstract: Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining h… ▽ More Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2311.02692

arXiv:2403.15726 [pdf, other]

Convection-Diffusion Equation: A Theoretically Certified Framework for Neural Networks

Authors: Tangjun Wang, Chenglong Bao, Zuoqiang Shi

Abstract: In this paper, we study the partial differential equation models of neural networks. Neural network can be viewed as a map from a simple base model to a complicate function. Based on solid analysis, we show that this map can be formulated by a convection-diffusion equation. This theoretically certified framework gives mathematical foundation and more understanding of neural networks. Moreover, bas… ▽ More In this paper, we study the partial differential equation models of neural networks. Neural network can be viewed as a map from a simple base model to a complicate function. Based on solid analysis, we show that this map can be formulated by a convection-diffusion equation. This theoretically certified framework gives mathematical foundation and more understanding of neural networks. Moreover, based on the convection-diffusion equation model, we design a novel network structure, which incorporates diffusion mechanism into network architecture. Extensive experiments on both benchmark datasets and real-world applications validate the performance of the proposed model. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2403.14621 [pdf, other]

GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation

Authors: Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, Gordon Wetzstein

Abstract: We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer… ▽ More We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer architecture and the use of 3D Gaussians unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate the superiority of our method over alternatives regarding both reconstruction quality and efficiency. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models. Our project website is at: https://justimyhxu.github.io/projects/grm/. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: Project page: https://justimyhxu.github.io/projects/grm/ Code: https://github.com/justimyhxu/GRM

arXiv:2403.13562 [pdf, other]

Augmented Labeled Random Finite Sets and Its Application to Group Target Tracking

Authors: Chaoqun Yang, Mengdie Xu, Xiaowei Liang, Zhiguo Shi, Heng Zhang, Xianghui Cao

Abstract: This paper addresses the problem of group target tracking (GTT), wherein multiple closely spaced targets within a group pose a coordinated motion. To improve the tracking performance, the labeled random finite sets (LRFSs) theory is adopted, and this paper develops a new kind of LRFSs, i.e., augmented LRFSs, which introduces group information into the definition of LRFSs. Specifically, for each el… ▽ More This paper addresses the problem of group target tracking (GTT), wherein multiple closely spaced targets within a group pose a coordinated motion. To improve the tracking performance, the labeled random finite sets (LRFSs) theory is adopted, and this paper develops a new kind of LRFSs, i.e., augmented LRFSs, which introduces group information into the definition of LRFSs. Specifically, for each element in an LRFS, the kinetic states, track label, and the corresponding group information of its represented target are incorporated. Furthermore, by means of the labeled multi-Bernoulli (LMB) filter with the proposed augmented LRFSs, the group structure is iteratively propagated and updated during the tracking process, which achieves the simultaneously estimation of the kinetic states, track label, and the corresponding group information of multiple group targets, and further improves the GTT tracking performance. Finally, simulation experiments are provided, which well demonstrates the effectiveness of the labeled multi-Bernoulli filter with the proposed augmented LRFSs for GTT tracking. △ Less

Submitted 16 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.11751 [pdf, other]

Relational Representation Learning Network for Cross-Spectral Image Patch Matching

Authors: Chuang Yu, Yunpeng Liu, **miao Zhao, Dou Quan, Zelin Shi

Abstract: Recently, feature relation learning has drawn widespread attention in cross-spectral image patch matching. However, existing related research focuses on extracting diverse relations between image patch features and ignores sufficient intrinsic feature representations of individual image patches. Therefore, an innovative relational representation learning idea is proposed for the first time, which… ▽ More Recently, feature relation learning has drawn widespread attention in cross-spectral image patch matching. However, existing related research focuses on extracting diverse relations between image patch features and ignores sufficient intrinsic feature representations of individual image patches. Therefore, an innovative relational representation learning idea is proposed for the first time, which simultaneously focuses on sufficiently mining the intrinsic features of individual image patches and the relations between image patch features. Based on this, we construct a lightweight Relational Representation Learning Network (RRL-Net). Specifically, we innovatively construct an autoencoder to fully characterize the individual intrinsic features, and introduce a Feature Interaction Learning (FIL) module to extract deep-level feature relations. To further fully mine individual intrinsic features, a lightweight Multi-dimensional Global-to-Local Attention (MGLA) module is constructed to enhance the global feature extraction of individual image patches and capture local dependencies within global features. By combining the MGLA module, we further explore the feature extraction network and construct an Attention-based Lightweight Feature Extraction (ALFE) network. In addition, we propose a Multi-Loss Post-Pruning (MLPP) optimization strategy, which greatly promotes network optimization while avoiding increases in parameters and inference time. Extensive experiments demonstrate that our RRL-Net achieves state-of-the-art (SOTA) performance on multiple public datasets. Our code will be made public later. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11465 [pdf]

Ultra-Long Homochiral Graphene Nanoribbons Grown Within h-BN Stacks for High-Performance Electronics

Authors: Bosai Lyu, Jiajun Chen, Sen Wang, Shuo Lou, Peiyue Shen, **gxu Xie, Lu Qiu, Izaac Mitchell, Can Li, Cheng Hu, Xianliang Zhou, Kenji Watanabe, Takashi Taniguchi, Xiaoqun Wang, **feng Jia, Qi Liang, Guorui Chen, Tingxin Li, Shiyong Wang, Wengen Ouyang, Oded Hod, Feng Ding, Michael Urbakh, Zhiwen Shi

Abstract: Van der Waals encapsulation of two-dimensional materials within hexagonal boron nitride (h-BN) stacks has proven to be a promising way to create ultrahigh-performance electronic devices. However, contemporary approaches for achieving van der Waals encapsulation, which involve artificial layer stacking using mechanical transfer techniques, are difficult to control, prone to contamination, and unsca… ▽ More Van der Waals encapsulation of two-dimensional materials within hexagonal boron nitride (h-BN) stacks has proven to be a promising way to create ultrahigh-performance electronic devices. However, contemporary approaches for achieving van der Waals encapsulation, which involve artificial layer stacking using mechanical transfer techniques, are difficult to control, prone to contamination, and unscalable. Here, we report on the transfer-free direct growth of high-quality graphene nanoribbons (GNRs) within h-BN stacks. The as-grown embedded GNRs exhibit highly desirable features being ultralong (up to 0.25 mm), ultranarrow ( < 5 nm), and homochiral with zigzag edges. Our atomistic simulations reveal that the mechanism underlying the embedded growth involves ultralow GNR friction when sliding between AA'-stacked h-BN layers. Using the grown structures, we demonstrate the transfer-free fabrication of embedded GNR field-effect devices that exhibit excellent performance at room temperature with mobilities of up to 4,600 $cm^{2} V^{-1} s^{-1}$ and on-off ratios of up to $10^{6}$. This paves the way to the bottom-up fabrication of high-performance electronic devices based on embedded layered materials. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.09960 [pdf, other]

Multivariate Gaussian Approximation for Random Forest via Region-based Stabilization

Authors: Zhaoyang Shi, Chinmoy Bhattacharjee, Krishnakumar Balasubramanian, Wolfgang Polonik

Abstract: We derive Gaussian approximation bounds for random forest predictions based on a set of training points given by a Poisson process, under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that the random forest predictions satisfy a certain geometric property called region-based stabilization. In the process of develo** our results fo… ▽ More We derive Gaussian approximation bounds for random forest predictions based on a set of training points given by a Poisson process, under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that the random forest predictions satisfy a certain geometric property called region-based stabilization. In the process of develo** our results for the random forest, we also establish a probabilistic result, which might be of independent interest, on multivariate Gaussian approximation bounds for general functionals of Poisson process that are region-based stabilizing. This general result makes use of the Malliavin-Stein method, and is potentially applicable to various related statistical problems. △ Less

Submitted 25 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.07712 [pdf, ps, other]

Nonlocal Stokes equation with relaxation on the divergence free equation

Authors: Yajie Zhang, Qiang Du, Zuoqiang Shi

Abstract: In this paper, we consider a new nonlocal approximation to the linear Stokes system with periodic boundary conditions in two and three dimensional spaces . A relaxation term is added to the equation of nonlocal divergence free equation, which is reminiscent to the relaxation of local Stokes equation with small artificial compressibility. Our analysis shows that the well-posedness of the nonlocal s… ▽ More In this paper, we consider a new nonlocal approximation to the linear Stokes system with periodic boundary conditions in two and three dimensional spaces . A relaxation term is added to the equation of nonlocal divergence free equation, which is reminiscent to the relaxation of local Stokes equation with small artificial compressibility. Our analysis shows that the well-posedness of the nonlocal system can be established under some mild assumptions on the kernel of nonlocal interactions. Furthermore, the new nonlocal system converges to the conventional, local Stokes system in second order as the horizon parameter of the nonlocal interaction goes to zero. The study provides more theoretical understanding to some numerical methods, such as smoothed particle hydrodynamics, for simulating incompressible viscous flows. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.07436 [pdf, other]

JSTR: Joint Spatio-Temporal Reasoning for Event-based Moving Object Detection

Authors: Hanyu Zhou, Zhiwei Shi, Hao Dong, Shihan Peng, Yi Chang, Luxin Yan

Abstract: Event-based moving object detection is a challenging task, where static background and moving object are mixed together. Typically, existing methods mainly align the background events to the same spatial coordinate system via motion compensation to distinguish the moving object. However, they neglect the potential spatial tailing effect of moving object events caused by excessive motion, which may… ▽ More Event-based moving object detection is a challenging task, where static background and moving object are mixed together. Typically, existing methods mainly align the background events to the same spatial coordinate system via motion compensation to distinguish the moving object. However, they neglect the potential spatial tailing effect of moving object events caused by excessive motion, which may affect the structure integrity of the extracted moving object. We discover that the moving object has a complete columnar structure in the point cloud composed of motion-compensated events along the timestamp. Motivated by this, we propose a novel joint spatio-temporal reasoning method for event-based moving object detection. Specifically, we first compensate the motion of background events using inertial measurement unit. In spatial reasoning stage, we project the compensated events into the same image coordinate, discretize the timestamp of events to obtain a time image that can reflect the motion confidence, and further segment the moving object through adaptive threshold on the time image. In temporal reasoning stage, we construct the events into a point cloud along timestamp, and use RANSAC algorithm to extract the columnar shape in the cloud for peeling off the background. Finally, we fuse the results from the two reasoning stages to extract the final moving object region. This joint spatio-temporal reasoning framework can effectively detect the moving object from motion confidence and geometric structure. Moreover, we conduct extensive experiments on various datasets to verify that the proposed method can improve the moving object detection accuracy by 13\%. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.07432 [pdf, other]

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

Authors: Hanyu Zhou, Yi Chang, Zhiwei Shi, Luxin Yan

Abstract: Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, which relies heavily on visual features to match motion features. Compared with single modality, existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However, these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous natu… ▽ More Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, which relies heavily on visual features to match motion features. Compared with single modality, existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However, these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR, thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work, we bring the event as a bridge between RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework for scene flow, which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion, we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging, and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion, we figure out that RGB, event and LiDAR are complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space, which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.07257 [pdf, other]

The Dawn of AI-Native EDA: Opportunities and Challenges of Large Circuit Models

Authors: Lei Chen, Yiqi Chen, Zhufei Chu, Wenji Fang, Tsung-Yi Ho, Ru Huang, Yu Huang, Sadaf Khan, Min Li, Xingquan Li, Yu Li, Yun Liang, **wei Liu, Yi Liu, Yibo Lin, Guojie Luo, Zhengyuan Shi, Guangyu Sun, Dimitrios Tsaras, Runsheng Wang, Ziyi Wang, Xinming Wei, Zhiyao Xie, Qiang Xu, Chenhao Xue , et al. (14 additional authors not shown)

Abstract: Within the Electronic Design Automation (EDA) domain, AI-driven solutions have emerged as formidable tools, yet they typically augment rather than redefine existing methodologies. These solutions often repurpose deep learning models from other domains, such as vision, text, and graph analytics, applying them to circuit design without tailoring to the unique complexities of electronic circuits. Suc… ▽ More Within the Electronic Design Automation (EDA) domain, AI-driven solutions have emerged as formidable tools, yet they typically augment rather than redefine existing methodologies. These solutions often repurpose deep learning models from other domains, such as vision, text, and graph analytics, applying them to circuit design without tailoring to the unique complexities of electronic circuits. Such an AI4EDA approach falls short of achieving a holistic design synthesis and understanding, overlooking the intricate interplay of electrical, logical, and physical facets of circuit data. This paper argues for a paradigm shift from AI4EDA towards AI-native EDA, integrating AI at the core of the design process. Pivotal to this vision is the development of a multimodal circuit representation learning technique, poised to provide a comprehensive understanding by harmonizing and extracting insights from varied data sources, such as functional specifications, RTL designs, circuit netlists, and physical layouts. We champion the creation of large circuit models (LCMs) that are inherently multimodal, crafted to decode and express the rich semantics and structures of circuit data, thus fostering more resilient, efficient, and inventive design methodologies. Embracing this AI-native philosophy, we foresee a trajectory that transcends the current innovation plateau in EDA, igniting a profound shift-left in electronic design methodology. The envisioned advancements herald not just an evolution of existing EDA tools but a revolution, giving rise to novel instruments of design tools that promise to radically enhance design productivity and inaugurate a new epoch where the optimization of circuit performance, power, and area (PPA) is achieved not incrementally, but through leaps that redefine the benchmarks of electronic systems' capabilities. △ Less

Submitted 1 May, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: The authors are ordered alphabetically. Contact: qxu@cse[dot]cuhk[dot]edu[dot]hk, gluo@pku[dot]edu[dot]cn, yuan.mingxuan@huawei[dot]com

arXiv:2403.05888 [pdf, ps, other]

A Second-Order Nonlocal Approximation to Manifold Poisson Models with Neumann Boundary

Authors: Yajie Zhang, Zuoqiang Shi

Abstract: In this paper, we propose a class of nonlocal models to approximate the Poisson model on manifolds with homogeneous Neumann boundary condition, where the manifolds are assumed to be embedded in high dimensional Euclid spaces. In comparison to the existing nonlocal approximation of Poisson models with Neumann boundary, we optimize the truncation error of model by adding an augmented term along the… ▽ More In this paper, we propose a class of nonlocal models to approximate the Poisson model on manifolds with homogeneous Neumann boundary condition, where the manifolds are assumed to be embedded in high dimensional Euclid spaces. In comparison to the existing nonlocal approximation of Poisson models with Neumann boundary, we optimize the truncation error of model by adding an augmented term along the $2δ$ layer of boundary, with $2δ$ be the nonlocal interaction horizon. Such term is formulated by the integration of the second order normal derivative of solution through the boundary, while the second order normal derivative is expressed as the difference between the interior Laplacian and the boundary Laplacian. The concentration of our paper is on the construction of nonlocal model, the well-posedness of model, and its second-order convergence rate to its local counterpart. The localization rate of our nonlocal model is currently optimal among all related works even for the case of high dimensional Euclid spaces. △ Less

Submitted 9 March, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2203.02120

arXiv:2403.05430 [pdf, other]

MambaLithium: Selective state space model for remaining-useful-life, state-of-health, and state-of-charge estimation of lithium-ion batteries

Authors: Zhuangwei Shi

Abstract: Recently, lithium-ion batteries occupy a pivotal position in the realm of electric vehicles and the burgeoning new energy industry. Their performance is heavily dependent on three core states: remaining-useful-life (RUL), state-of-health (SOH), and state-of-charge (SOC). Given the remarkable success of Mamba (Structured state space sequence models with selection mechanism and scan module, S6) in s… ▽ More Recently, lithium-ion batteries occupy a pivotal position in the realm of electric vehicles and the burgeoning new energy industry. Their performance is heavily dependent on three core states: remaining-useful-life (RUL), state-of-health (SOH), and state-of-charge (SOC). Given the remarkable success of Mamba (Structured state space sequence models with selection mechanism and scan module, S6) in sequence modeling tasks, this paper introduces MambaLithium, a selective state space model tailored for precise estimation of these critical battery states. Leveraging Mamba algorithms, MambaLithium adeptly captures the intricate aging and charging dynamics of lithium-ion batteries. By focusing on pivotal states within the battery's operational envelope, MambaLithium not only enhances estimation accuracy but also maintains computational robustness. Experiments conducted using real-world battery data have validated the model's superiority in predicting battery health and performance metrics, surpassing current methods. The proposed MambaLithium framework is potential for applications in advancing battery management systems and fostering sustainable energy storage solutions. Source code is available at https://github.com/zshicode/MambaLithium. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2402.18959; text overlap with arXiv:2204.02623

arXiv:2403.04194 [pdf, other]

SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising

Authors: Tao Zhou, Wenhan Luo, Qi Ye, Zhiguo Shi, Jiming Chen

Abstract: Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt inputs, such as imprecise bounding boxes. In this paper, we explore the potential of applying SAM to track and segment objects in videos where we recognize the tr… ▽ More Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt inputs, such as imprecise bounding boxes. In this paper, we explore the potential of applying SAM to track and segment objects in videos where we recognize the tracking task as a prompt denoising task. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy where we provide multiple jittered and scaled box prompts for each object and preserve the mask prediction with the highest semantic similarity to the template mask. We also introduce a point-based refinement stage to handle occlusions and reduce cumulative errors. Without involving tracking modules, our approach demonstrates comparable performance in video object/instance segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO, serving as a concise baseline and endowing SAM-based downstream applications with tracking capabilities. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.03031 [pdf, other]

Learning to Use Tools via Cooperative and Interactive Agents

Authors: Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Pengjie Ren, Suzan Verberne, Zhaochun Ren

Abstract: Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating execution results into the next action prediction. Despite their progress, these methods suffer from performance degradation when addressing practical tasks due to: (1) the… ▽ More Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating execution results into the next action prediction. Despite their progress, these methods suffer from performance degradation when addressing practical tasks due to: (1) the pre-defined pipeline with restricted flexibility to calibrate incorrect actions, and (2) the struggle to adapt a general LLM-based agent to perform a variety of specialized actions. To mitigate these problems, we propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. ConAgents introduces two communication protocols to enable the flexible cooperation of agents. To effectively generalize the ConAgents into open-source models, we also propose specialized action distillation, enhancing their ability to perform specialized actions in our framework. Our extensive experiments on three datasets show that the LLMs, when equipped with the ConAgents, outperform baselines with substantial improvement (i.e., up to 14% higher success rate). △ Less

Submitted 22 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: working in process

arXiv:2403.02714 [pdf, other]

DomainVerse: A Benchmark Towards Real-World Distribution Shifts For Tuning-Free Adaptive Domain Generalization

Authors: Feng Hou, ** Yuan, Ying Yang, Yang Liu, Yang Zhang, Cheng Zhong, Zhongchao Shi, Jian** Fan, Yong Rui, Zhiqiang He

Abstract: Traditional cross-domain tasks, including domain adaptation and domain generalization, rely heavily on training model by source domain data. With the recent advance of vision-language models (VLMs), viewed as natural source models, the cross-domain task changes to directly adapt the pre-trained source model to arbitrary target domains equipped with prior domain knowledge, and we name this task Ada… ▽ More Traditional cross-domain tasks, including domain adaptation and domain generalization, rely heavily on training model by source domain data. With the recent advance of vision-language models (VLMs), viewed as natural source models, the cross-domain task changes to directly adapt the pre-trained source model to arbitrary target domains equipped with prior domain knowledge, and we name this task Adaptive Domain Generalization (ADG). However, current cross-domain datasets have many limitations, such as unrealistic domains, unclear domain definitions, and the inability to fine-grained domain decomposition, which drives us to establish a novel dataset DomainVerse for ADG. Benefiting from the introduced hierarchical definition of domain shifts, DomainVerse consists of about 0.5 million images from 390 fine-grained realistic domains. With the help of the constructed DomainVerse and VLMs, we propose two methods called Domain CLIP and Domain++ CLIP for tuning-free adaptive domain generalization. Extensive and comprehensive experiments demonstrate the significance of the dataset and the effectiveness of the proposed methods. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Currently in review for ICML 2024

arXiv:2403.01093 [pdf, other]

Variational Bayesian Learning Based Localization and Channel Reconstruction in RIS-aided Systems

Authors: Yunfei Li, Yiting Luo, Xianda Wu, Zheng Shi, Shaodan Ma, Guanghua Yang

Abstract: The emerging immersive and autonomous services have posed stringent requirements on both communications and localization. By considering the great potential of reconfigurable intelligent surface (RIS), this paper focuses on the joint channel estimation and localization for RIS-aided wireless systems. As opposed to existing works that treat channel estimation and localization independently, this pa… ▽ More The emerging immersive and autonomous services have posed stringent requirements on both communications and localization. By considering the great potential of reconfigurable intelligent surface (RIS), this paper focuses on the joint channel estimation and localization for RIS-aided wireless systems. As opposed to existing works that treat channel estimation and localization independently, this paper exploits the intrinsic coupling and nonlinear relationships between the channel parameters and user location for enhancement of both localization and channel reconstruction. By noticing the non-convex, nonlinear objective function and the sparser angle pattern, a variational Bayesian learning-based framework is developed to jointly estimate the channel parameters and user location through leveraging an effective approximation of the posterior distribution. The proposed framework is capable of unifying near-field and far-field scenarios owing to exploitation of sparsity of the angular domain. Since the joint channel and location estimation problem has a closed-form solution in each iteration, our proposed iterative algorithm performs better than the conventional particle swarm optimization (PSO) and maximum likelihood (ML) based ones in terms of computational complexity. Simulations demonstrate that the proposed algorithm almost reaches the Bayesian Cramer-Rao bound (BCRB) and achieves a superior estimation accuracy by comparing to the PSO and the ML algorithms. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2403.00496 [pdf]

Benchmarking reconstructive spectrometer with multi-resonant cavities

Authors: Chunhui Yao, Kangning Xu, Tianhua Lin, Jie Ma, Chumeng Yao, Peng Bao, Zhitian Shi, Richard Penty, Qixiang Cheng

Abstract: Recent years have seen the rapid development of miniaturized reconstructive spectrometers (RSs), yet they still confront a range of technical challenges, such as bandwidth/resolution ratio, sensing speed, and/or power efficiency. Reported RS designs often suffer from insufficient decorrelation between sampling channels, which results in limited compressive sampling efficiency, in essence, due to i… ▽ More Recent years have seen the rapid development of miniaturized reconstructive spectrometers (RSs), yet they still confront a range of technical challenges, such as bandwidth/resolution ratio, sensing speed, and/or power efficiency. Reported RS designs often suffer from insufficient decorrelation between sampling channels, which results in limited compressive sampling efficiency, in essence, due to inadequate engineering of sampling responses. This in turn leads to poor spectral-pixel-to-channel ratios (SPCRs), typically restricted at single digits. So far, there lacks a general guideline for manipulating RS sampling responses for the effectiveness of spectral information acquisition. In this study, we shed light on a fundamental parameter from the compressive sensing theory - the average mutual correlation coefficient v - and provide insight into how it serves as a critical benchmark in RS design with regards to the SPCR and reconstruction accuracy. To this end, we propose a novel RS design with multi-resonant cavities, consisting of a series of partial reflective interfaces. Such multi-cavity configuration offers an expansive parameter space, facilitating the superlative optimization of sampling matrices with minimized v. As a proof-of-concept demonstration, a single-shot, dual-band RS is implemented on a SiN platform, tailored for capturing signature spectral shapes across different wavelength regions, with customized photonic crystal nanobeam mirrors. Experimentally, the device demonstrates an overall operation bandwidth of 270 nm and a <0.5 nm resolution with only 15 sampling channels per band, leading to a record high SPCR of 18.0. Moreover, the proposed multi-cavity design can be readily adapted to various photonic platforms. For instance, we showcase that by employing multi-layer coatings, an ultra-broadband RS can be optimized to exhibit a 700 nm bandwidth with an SPCR of over 100. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.18959 [pdf, other]

MambaStock: Selective state space model for stock prediction

Authors: Zhuangwei Shi

Abstract: The stock market plays a pivotal role in economic development, yet its intricate volatility poses challenges for investors. Consequently, research and accurate predictions of stock price movements are crucial for mitigating risks. Traditional time series models fall short in capturing nonlinearity, leading to unsatisfactory stock predictions. This limitation has spurred the widespread adoption of… ▽ More The stock market plays a pivotal role in economic development, yet its intricate volatility poses challenges for investors. Consequently, research and accurate predictions of stock price movements are crucial for mitigating risks. Traditional time series models fall short in capturing nonlinearity, leading to unsatisfactory stock predictions. This limitation has spurred the widespread adoption of neural networks for stock prediction, owing to their robust nonlinear generalization capabilities. Recently, Mamba, a structured state space sequence model with a selection mechanism and scan module (S6), has emerged as a powerful tool in sequence modeling tasks. Leveraging this framework, this paper proposes a novel Mamba-based model for stock price prediction, named MambaStock. The proposed MambaStock model effectively mines historical stock market data to predict future stock prices without handcrafted features or extensive preprocessing procedures. Empirical studies on several stocks indicate that the MambaStock model outperforms previous methods, delivering highly accurate predictions. This enhanced accuracy can assist investors and institutions in making informed decisions, aiming to maximize returns while minimizing risks. This work underscores the value of Mamba in time-series forecasting. Source code is available at https://github.com/zshicode/MambaStock. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2204.02623

arXiv:2402.16459 [pdf, other]

Defending LLMs against Jailbreaking Attacks via Backtranslation

Authors: Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh

Abstract: Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input pro… ▽ More Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/llm-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}. △ Less

Submitted 6 June, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.15048 [pdf, other]

Unlocking the Power of Large Language Models for Entity Alignment

Authors: Xuhui Jiang, Yinghan Shen, Zhichao Shi, Cheng** Xu, Wei Li, Zixuan Li, Jian Guo, Huawei Shen, Yuanzhuo Wang

Abstract: Entity Alignment (EA) is vital for integrating diverse knowledge graph (KG) data, playing a crucial role in data-driven AI applications. Traditional EA methods primarily rely on comparing entity embeddings, but their effectiveness is constrained by the limited input KG data and the capabilities of the representation learning techniques. Against this backdrop, we introduce ChatEA, an innovative fra… ▽ More Entity Alignment (EA) is vital for integrating diverse knowledge graph (KG) data, playing a crucial role in data-driven AI applications. Traditional EA methods primarily rely on comparing entity embeddings, but their effectiveness is constrained by the limited input KG data and the capabilities of the representation learning techniques. Against this backdrop, we introduce ChatEA, an innovative framework that incorporates large language models (LLMs) to improve EA. To address the constraints of limited input KG data, ChatEA introduces a KG-code translation module that translates KG structures into a format understandable by LLMs, thereby allowing LLMs to utilize their extensive background knowledge to improve EA accuracy. To overcome the over-reliance on entity embedding comparisons, ChatEA implements a two-stage EA strategy that capitalizes on LLMs' capability for multi-step reasoning in a dialogue format, thereby enhancing accuracy while preserving efficiency. Our experimental results affirm ChatEA's superior performance, highlighting LLMs' potential in facilitating EA tasks. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.15017 [pdf, other]

Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning

Authors: Zhuoyan Xu, Zhenmei Shi, Junyi Wei, Fangzhou Mu, Yin Li, Yingyu Liang

Abstract: Foundation models have emerged as a powerful tool for many AI problems. Despite the tremendous success of foundation models, effective adaptation to new tasks, particularly those with limited labels, remains an open question and lacks theoretical understanding. An emerging solution with recent success in vision and NLP involves finetuning a foundation model on a selection of relevant tasks, before… ▽ More Foundation models have emerged as a powerful tool for many AI problems. Despite the tremendous success of foundation models, effective adaptation to new tasks, particularly those with limited labels, remains an open question and lacks theoretical understanding. An emerging solution with recent success in vision and NLP involves finetuning a foundation model on a selection of relevant tasks, before its adaptation to a target task with limited labeled samples. In this paper, we study the theoretical justification of this multitask finetuning approach. Our theoretical analysis reveals that with a diverse set of related tasks, this multitask finetuning leads to reduced error in the target task, in comparison to directly adapting the same pretrained model. We quantify the relationship between finetuning tasks and target tasks by diversity and consistency metrics, and further propose a practical task selection algorithm. We substantiate our theoretical claims with extensive empirical evidence. Further, we present results affirming our task selection algorithm adeptly chooses related finetuning tasks, providing advantages to the model performance on target tasks. We believe our study shed new light on the effective adaptation of foundation models to new tasks that lack abundant labels. Our code is available at https://github.com/OliverXUZY/Foudation-Model_Multitask. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: Published at ICLR 2024. 54 pages

arXiv:2402.14985 [pdf, other]

Nonsmooth Nonparametric Regression via Fractional Laplacian Eigenmaps

Authors: Zhaoyang Shi, Krishnakumar Balasubramanian, Wolfgang Polonik

Abstract: We develop nonparametric regression methods for the case when the true regression function is not necessarily smooth. More specifically, our approach is using the fractional Laplacian and is designed to handle the case when the true regression function lies in an $L_2$-fractional Sobolev space with order $s\in (0,1)$. This function class is a Hilbert space lying between the space of square-integra… ▽ More We develop nonparametric regression methods for the case when the true regression function is not necessarily smooth. More specifically, our approach is using the fractional Laplacian and is designed to handle the case when the true regression function lies in an $L_2$-fractional Sobolev space with order $s\in (0,1)$. This function class is a Hilbert space lying between the space of square-integrable functions and the first-order Sobolev space consisting of differentiable functions. It contains fractional power functions, piecewise constant or polynomial functions and bump function as canonical examples. For the proposed approach, we prove upper bounds on the in-sample mean-squared estimation error of order $n^{-\frac{2s}{2s+d}}$, where $d$ is the dimension, $s$ is the aforementioned order parameter and $n$ is the number of observations. We also provide preliminary empirical results validating the practical performance of the developed estimators. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.14000 [pdf, other]

Real-time 3D-aware Portrait Editing from a Single Image

Authors: Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, Qifeng Chen

Abstract: This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two comp… ▽ More This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our system achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ~5min fine-tuning per style). The code, the model, and the interface will be made publicly available to facilitate future research. △ Less

Submitted 2 April, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.12436 [pdf, other]

Excitonic quantum criticality: from bilayer graphene to narrow Chern bands

Authors: Zhengyan Darius Shi, Hart Goldman, Zhihuan Dong, T. Senthil

Abstract: We study a family of excitonic quantum phase transitions describing the evolution of a bilayer metallic state to an inter-layer coherent state where excitons condense. We argue that such transitions can be continuous and exhibit a non-Fermi liquid counterflow response ${ρ_{\mathrm{counterflow}}(ω)\simω^{2/z}}$ that directly encodes the dynamical critical exponent $z$. Our calculations are performe… ▽ More We study a family of excitonic quantum phase transitions describing the evolution of a bilayer metallic state to an inter-layer coherent state where excitons condense. We argue that such transitions can be continuous and exhibit a non-Fermi liquid counterflow response ${ρ_{\mathrm{counterflow}}(ω)\simω^{2/z}}$ that directly encodes the dynamical critical exponent $z$. Our calculations are performed within a controlled expansion around $z = 2$. This physics is relevant to any system with spin, valley, or layer degrees of freedom. We consider two contexts for excitonic quantum criticality: (1) a weakly interacting graphene bilayer, and (2) a system of two narrow, half-filled Chern bands at zero external magnetic field, with total Chern number $C_{\mathrm{tot}}=0$, which may soon be realizable in moiré materials. The latter system hosts a time-reversed pair of composite Fermi liquid states, and the condensation of excitons of the composite fermions leads to an exotic exciton insulator* state with a charge neutral Fermi surface. Our work sheds new light on the physics of inter-layer coherence transitions in 2D materials. △ Less

Submitted 26 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: 30 pages + 26 pages appendices, 8 figures. (v2) Includes the effect of momentum dependence in the fermion self energy, which was neglected in (v1). Appendix A has been expanded to clarify the relationship between this work and the existing literature on ferromagnetic transitions in metals

arXiv:2402.11988 [pdf]

Photonic Chiplet Interconnection via 3D-Nanoprinted Interposer

Authors: Huiyu Huang, Zhitian Shi, Giuseppe Talli, Maxim Kuschnerov, Richard Penty, Qixiang Cheng

Abstract: Photonic integrated circuits utilize various waveguide materials, each excelling in specific metrics like efficient light emission, low propagation loss, high electro-optic efficiency, and potential for mass production. Inherent shortcomings in each platform push exploration of hybrid and heterogeneous integration, which demands specialized designs and extra fabrication processes for each material… ▽ More Photonic integrated circuits utilize various waveguide materials, each excelling in specific metrics like efficient light emission, low propagation loss, high electro-optic efficiency, and potential for mass production. Inherent shortcomings in each platform push exploration of hybrid and heterogeneous integration, which demands specialized designs and extra fabrication processes for each material combination. Our work introduces a novel hybrid integration scheme employing a 3D-nanoprinted interposer for a photonic chiplet interconnection system. This method represents a generic solution that can readily couple between chips of any material system, with each fabricated on its own technology platform with no change in the established process flow for the individual chips. Mode-size engineering is enhanced by the off-chip parabolic micro-reflectors. The 3D-nanoprinted chip-coupling frame and fiber-guiding funnel enable low-loss, fully passive assembly with a fast-printing process achieving sub-micron accuracy. Mode-field-dimension conversion ratio of 5:2 from fiber to chip is demonstrated with <0.5dB excess loss on top of the 1.7dB inherent coupling loss, marking the largest mode size conversion using non-waveguided components. Additionally, our system demonstrates a 2.5dB die-to-die coupling loss between silicon and InP chips over a 140nm wavelength range (1480nm to 1620nm), showcasing the potential for extensive cross-platform integration by bridging different waveguide materials. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.11818 [pdf, other]

Where It Really Matters: Few-Shot Environmental Conservation Media Monitoring for Low-Resource Languages

Authors: Sameer Jain, Sedrick Scott Keh, Shova Chettri, Karun Dewan, Pablo Izquierdo, Johanna Prussman, Pooja Shreshtha, Cesar Suarez, Zheyuan Ryan Shi, Lei Li, Fei Fang

Abstract: Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such tools are most… ▽ More Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such tools are most needed in the global south where news of interest is mainly in local low-resource languages, and far fewer experts are available to annotate datasets sustainably. In this paper, we propose NewsSerow, a method to automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods and achieves comparable performance with models fully fine-tuned using thousands of examples. The World Wide Fund for Nature (WWF) has deployed NewsSerow for media monitoring in Nepal, significantly reducing their operational burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow has also been deployed for countries with other languages like Colombia. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: AAAI 2024: AI for Social Impact Track

arXiv:2402.09469 [pdf, other]

Fourier Circuits in Neural Networks: Unlocking the Potential of Large Language Models in Mathematical Reasoning and Modular Arithmetic

Authors: Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Tianyi Zhou

Abstract: In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focu… ▽ More In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focus to the complex algebraic learning task of modular addition involving $k$ inputs. Our research presents a thorough analytical characterization of the features learned by stylized one-hidden layer neural networks and one-layer Transformers in addressing this task. A cornerstone of our theoretical framework is the elucidation of how the principle of margin maximization shapes the features adopted by one-hidden layer neural networks. Let $p$ denote the modulus, $D_p$ denote the dataset of modular arithmetic with $k$ inputs and $m$ denote the network width. We demonstrate that a neuron count of $ m \geq 2^{2k-2} \cdot (p-1) $, these networks attain a maximum $ L_{2,k+1} $-margin on the dataset $ D_p $. Furthermore, we establish that each hidden-layer neuron aligns with a specific Fourier spectrum, integral to solving modular addition problems. By correlating our findings with the empirical observations of similar studies, we contribute to a deeper comprehension of the intrinsic computational mechanisms of neural networks. Furthermore, we observe similar computational mechanisms in the attention matrix of the one-layer Transformer. This research stands as a significant stride in unraveling their operation complexities, particularly in the realm of complex algebraic tasks. △ Less

Submitted 24 May, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: Update Section 5.3; clean up problem setup

arXiv:2402.05940 [pdf]

Causal Relationship Network of Risk Factors Impacting Workday Loss in Underground Coal Mines

Authors: Shangsi Ren, Cameron A. Beeche, Zhiyi Shi, Maria Acevedo Garcia, Katherine Zychowski, Shuguang Leng, Pedram Roghanchi, Jiantao Pu

Abstract: This study aims to establish the causal relationship network between various factors leading to workday loss in underground coal mines using a novel causal artificial intelligence (AI) method. The analysis utilizes data obtained from the National Institute for Occupational Safety and Health (NIOSH). A total of 101,010 injury records from 3,982 unique underground coal mines spanning the years from… ▽ More This study aims to establish the causal relationship network between various factors leading to workday loss in underground coal mines using a novel causal artificial intelligence (AI) method. The analysis utilizes data obtained from the National Institute for Occupational Safety and Health (NIOSH). A total of 101,010 injury records from 3,982 unique underground coal mines spanning the years from 1990 to 2020 were extracted from the NIOSH database. Causal relationships were analyzed and visualized using a novel causal AI method called Grouped Greedy Equivalence Search (GGES). The impact of each variable on workday loss was assessed through intervention do-calculus adjustment (IDA) scores. Model training and validation were performed using the 10-fold cross-validation technique. Performance metrics, including adjacency precision (AP), adjacency recall (AR), arrowhead precision (AHP), and arrowhead recall (AHR), were utilized to evaluate the models. Findings revealed that after 2006, key direct causes of workday loss among mining employees included total mining experience, mean office employees, mean underground employees, county, and total mining experience (years). Total mining experience emerged as the most influential factor, whereas mean employees per mine exhibited the least influence. The analyses emphasized the significant role of total mining experience in determining workday loss. The models achieved optimal performance, with AP, AR, AHP, and AHR values measuring 0.694, 0.653, 0.386, and 0.345, respectively. This study demonstrates the feasibility of utilizing the new GGES method to clarify the causal factors behind the workday loss by analyzing employment demographics and injury records and establish their causal relationship network. △ Less

Submitted 24 January, 2024; originally announced February 2024.

Comments: 5 figures 5 tables

arXiv:2402.03569 [pdf, other]

The Invisible Game on the Internet: A Case Study of Decoding Deceptive Patterns

Authors: Zewei Shi, Ruoxi Sun, Jieshan Chen, Jiamou Sun, Minhui Xue

Abstract: Deceptive patterns are design practices embedded in digital platforms to manipulate users, representing a widespread and long-standing issue in the web and mobile software development industry. Legislative actions highlight the urgency of globally regulating deceptive patterns. However, despite advancements in detection tools, a significant gap exists in assessing deceptive pattern risks. In this… ▽ More Deceptive patterns are design practices embedded in digital platforms to manipulate users, representing a widespread and long-standing issue in the web and mobile software development industry. Legislative actions highlight the urgency of globally regulating deceptive patterns. However, despite advancements in detection tools, a significant gap exists in assessing deceptive pattern risks. In this study, we introduce a comprehensive approach involving the interactions between the Adversary, Watchdog (e.g., detection tools), and Challengers (e.g., users) to formalize and decode deceptive pattern threats. Based on this, we propose a quantitative risk assessment system. Representative cases are analyzed to showcase the practicability of the proposed risk scoring system, emphasizing the importance of involving human factors in deceptive pattern risk assessment. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.01647 [pdf, other]

Build Your Own Robot Friend: An Open-Source Learning Module for Accessible and Engaging AI Education

Authors: Zhonghao Shi, Allison O'Connell, Zongjian Li, Siqi Liu, Jennifer Ayissi, Guy Hoffman, Mohammad Soleymani, Maja J. Matarić

Abstract: As artificial intelligence (AI) is playing an increasingly important role in our society and global economy, AI education and literacy have become necessary components in college and K-12 education to prepare students for an AI-powered society. However, current AI curricula have not yet been made accessible and engaging enough for students and schools from all socio-economic backgrounds with diffe… ▽ More As artificial intelligence (AI) is playing an increasingly important role in our society and global economy, AI education and literacy have become necessary components in college and K-12 education to prepare students for an AI-powered society. However, current AI curricula have not yet been made accessible and engaging enough for students and schools from all socio-economic backgrounds with different educational goals. In this work, we developed an open-source learning module for college and high school students, which allows students to build their own robot companion from the ground up. This open platform can be used to provide hands-on experience and introductory knowledge about various aspects of AI, including robotics, machine learning (ML), software engineering, and mechanical engineering. Because of the social and personal nature of a socially assistive robot companion, this module also puts a special emphasis on human-centered AI, enabling students to develop a better understanding of human-AI interaction and AI ethics through hands-on learning activities. With open-source documentation, assembling manuals and affordable materials, students from different socio-economic backgrounds can personalize their learning experience based on their individual educational goals. To evaluate the student-perceived quality of our module, we conducted a usability testing workshop with 15 college students recruited from a minority-serving institution. Our results indicate that our AI module is effective, easy-to-follow, and engaging, and it increases student interest in studying AI/ML and robotics in the future. We hope that this work will contribute toward accessible and engaging AI education in human-AI interaction for college and high school students. △ Less

Submitted 6 January, 2024; originally announced February 2024.

Comments: Accepted to the Proceedings of the AAAI Conference on Artificial Intelligence (2024)

arXiv:2402.00379 [pdf, other]

Error-Tolerant Amplification and Simulation of the Ultrastrong-Coupling Quantum Rabi Model

Authors: Ye-Hong Chen, Zhi-Cheng Shi, Franco Nori, Yan Xia

Abstract: Cat-state qubits formed by photonic cat states show great promise for hardware-efficient universal quantum computing. We demonstrate that cat-state qubits are also promising for error-tolerant simulations of the quantum Rabi model (and its varieties) to enhance the coupling strength between the cat-state qubit and a cavity, to reach the ultrastrong-coupling regime. This allows us to explore severa… ▽ More Cat-state qubits formed by photonic cat states show great promise for hardware-efficient universal quantum computing. We demonstrate that cat-state qubits are also promising for error-tolerant simulations of the quantum Rabi model (and its varieties) to enhance the coupling strength between the cat-state qubit and a cavity, to reach the ultrastrong-coupling regime. This allows us to explore several fascinating quantum phenomena relying on the counter-rotating interaction. A benefit from biased-noise cat qubits is that the two main error channels (frequency and amplitude mismatches) are both exponentially suppressed. Therefore, the simulation protocols are robust against parameter errors of the parametric drive which determines the projection subspace. We analyze three examples: (i) collapse and revivals of quantum states; (ii) hidden symmetry and tunneling dynamics; and (iii) pair-cat-code computation. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 7 pages, 7 figures, comments are welcome

arXiv:2401.17642 [pdf, other]

Exploring the Common Appearance-Boundary Adaptation for Nighttime Optical Flow

Authors: Hanyu Zhou, Yi Chang, Haoyue Liu, Wending Yan, Yuxing Duan, Zhiwei Shi, Luxin Yan

Abstract: We investigate a challenging task of nighttime optical flow, which suffers from weakened texture and amplified noise. These degradations weaken discriminative visual features, thus causing invalid motion feature matching. Typically, existing methods employ domain adaptation to transfer knowledge from auxiliary domain to nighttime domain in either input visual space or output motion space. However,… ▽ More We investigate a challenging task of nighttime optical flow, which suffers from weakened texture and amplified noise. These degradations weaken discriminative visual features, thus causing invalid motion feature matching. Typically, existing methods employ domain adaptation to transfer knowledge from auxiliary domain to nighttime domain in either input visual space or output motion space. However, this direct adaptation is ineffective, since there exists a large domain gap due to the intrinsic heterogeneous nature of the feature representations between auxiliary and nighttime domains. To overcome this issue, we explore a common-latent space as the intermediate bridge to reinforce the feature alignment between auxiliary and nighttime domains. In this work, we exploit two auxiliary daytime and event domains, and propose a novel common appearance-boundary adaptation framework for nighttime optical flow. In appearance adaptation, we employ the intrinsic image decomposition to embed the auxiliary daytime image and the nighttime image into a reflectance-aligned common space. We discover that motion distributions of the two reflectance maps are very similar, benefiting us to consistently transfer motion appearance knowledge from daytime to nighttime domain. In boundary adaptation, we theoretically derive the motion correlation formula between nighttime image and accumulated events within a spatiotemporal gradient-aligned common space. We figure out that the correlation of the two spatiotemporal gradient maps shares significant discrepancy, benefitting us to contrastively transfer boundary knowledge from event to nighttime domain. Moreover, appearance adaptation and boundary adaptation are complementary to each other, since they could jointly transfer global motion and local boundary knowledge to the nighttime domain. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Journal ref: International Conference on Learning Representations (ICLR), 2024

arXiv:2401.15619 [pdf, ps, other]

A semidefinite programming approach for robust elliptic localization

Authors: Wenxin Xiong, Jiajun He, Zhang-Lei Shi, Keyuan Hu, Hing Cheung So, Chi-Sing Leung

Abstract: This short communication addresses the problem of elliptic localization with outlier measurements, whose occurrences are prevalent in various location-enabled applications and can significantly compromise the positioning performance if not adequately handled. In contrast to the reliance on $M$-estimation adopted in the majority of existing solutions, we take a different path, specifically explorin… ▽ More This short communication addresses the problem of elliptic localization with outlier measurements, whose occurrences are prevalent in various location-enabled applications and can significantly compromise the positioning performance if not adequately handled. In contrast to the reliance on $M$-estimation adopted in the majority of existing solutions, we take a different path, specifically exploring the worst-case robust approximation criterion, to bolster resistance of the elliptic location estimator against outliers. From a geometric standpoint, our method boils down to pinpointing the Chebyshev center of the feasible set determined by the available bistatic ranges with bounded measurement errors. For a practical approach to the associated min-max problem, we convert it into the well-established convex optimization framework of semidefinite programming (SDP). Numerical simulations confirm that our SDP-based technique can outperform a number of existing elliptic localization schemes in terms of positioning accuracy in Gaussian mixture noise, a common type of impulsive interference in the context of range-based localization. △ Less

Submitted 28 January, 2024; originally announced January 2024.

Showing 51–100 of 1,127 results for author: Shi, Z