Search | arXiv e-print repository

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

Authors: Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, **gdong Chen, Yihua Tan, Yansheng Li

Abstract: Remote Sensing Large Multi-Modal Models (RSLMMs) are develo** rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs' complex comprehension ability, we propose a large-sca… ▽ More Remote Sensing Large Multi-Modal Models (RSLMMs) are develo** rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs' complex comprehension ability, we propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples. FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty, ranging from relation reasoning to image-level scene graph generation. Based on FIT-RS, we build the FIT-RSFG benchmark. Furthermore, we establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC. Based on combined instruction data, we propose SkySenseGPT, which achieves outstanding performance on both public datasets and FIT-RSFG, surpassing existing RSLMMs. We hope the FIT-RS dataset can enhance the relation comprehension capability of RSLMMs and provide a large-scale fine-grained data source for the remote sensing community. The dataset will be available at https://github.com/Luo-Z13/SkySenseGPT △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 30 pages, 5 figures, 19 tables, dataset and code see https://github.com/Luo-Z13/SkySenseGPT

arXiv:2406.08476 [pdf, other]

RMem: Restricted Memory Banks Improve Video Object Segmentation

Authors: Junbao Zhou, Ziqi Pang, Yu-Xiong Wang

Abstract: With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expan… ▽ More With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally, our insights are embodied in "RMem" ("R" for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (on the VOST dataset) and long videos (on the Long Videos dataset). Our code and demo are available at https://restricted-memory.github.io/. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: CVPR 2024, Project Page: https://restricted-memory.github.io/

arXiv:2403.09993 [pdf, other]

TRG-Net: An Interpretable and Controllable Rain Generator

Authors: Zhiqiang Pang, Hong Wang, Qi Xie, Deyu Meng, Zongben Xu

Abstract: Exploring and modeling rain generation mechanism is critical for augmenting paired data to ease training of rainy image processing models. Against this task, this study proposes a novel deep learning based rain generator, which fully takes the physical generation mechanism underlying rains into consideration and well encodes the learning of the fundamental rain factors (i.e., shape, orientation, l… ▽ More Exploring and modeling rain generation mechanism is critical for augmenting paired data to ease training of rainy image processing models. Against this task, this study proposes a novel deep learning based rain generator, which fully takes the physical generation mechanism underlying rains into consideration and well encodes the learning of the fundamental rain factors (i.e., shape, orientation, length, width and sparsity) explicitly into the deep network. Its significance lies in that the generator not only elaborately design essential elements of the rain to simulate expected rains, like conventional artificial strategies, but also finely adapt to complicated and diverse practical rainy images, like deep learning methods. By rationally adopting filter parameterization technique, we first time achieve a deep network that is finely controllable with respect to rain factors and able to learn the distribution of these factors purely from data. Our unpaired generation experiments demonstrate that the rain generated by the proposed rain generator is not only of higher quality, but also more effective for deraining and downstream tasks compared to current state-of-the-art rain generation methods. Besides, the paired data augmentation experiments, including both in-distribution and out-of-distribution (OOD), further validate the diversity of samples generated by our model for in-distribution deraining and OOD generalization tasks. △ Less

Submitted 29 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2402.17486 [pdf, other]

MGE: A Training-Free and Efficient Model Generation and Enhancement Scheme

Authors: Xuan Wang, Zeshan Pang, Yuliang Lu, Xuehu Yan

Abstract: To provide a foundation for the research of deep learning models, the construction of model pool is an essential step. This paper proposes a Training-Free and Efficient Model Generation and Enhancement Scheme (MGE). This scheme primarily considers two aspects during the model generation process: the distribution of model parameters and model performance. Experiments result shows that generated mod… ▽ More To provide a foundation for the research of deep learning models, the construction of model pool is an essential step. This paper proposes a Training-Free and Efficient Model Generation and Enhancement Scheme (MGE). This scheme primarily considers two aspects during the model generation process: the distribution of model parameters and model performance. Experiments result shows that generated models are comparable to models obtained through normal training, and even superior in some cases. Moreover, the time consumed in generating models accounts for only 1\% of the time required for normal model training. More importantly, with the enhancement of Evolution-MGE, generated models exhibits competitive generalization ability in few-shot tasks. And the behavioral dissimilarity of generated models has the potential of adversarial defense. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.12770 [pdf, other]

Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

Authors: Zi Haur Pang, Yahui Fu, Divesh Lala, Keiko Ochi, Koji Inoue, Tatsuya Kawahara

Abstract: In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorpora… ▽ More In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation. Utilizing Japanese EmpatheticDialogues dataset - a textual-based dialogue dataset consisting of 8 emotional categories from Plutchik's wheel of emotions - the Task Adaptive Pre-Training (TAPT) BERT-based model outperforms both random baseline and the ChatGPT performance, in term of F1-score, in all modules. Further validation of our model's efficacy is confirmed in its application to the TUT Emotional Storytelling Corpus (TESC), a speech-based dialogue dataset, by surpassing both random baseline and the ChatGPT. This consistent performance across both textual and speech-based dialogues underscores the effectiveness of our framework in fostering empathetic human-AI communication. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

arXiv:2311.03194 [pdf]

Few-shot Learning using Data Augmentation and Time-Frequency Transformation for Time Series Classification

Authors: Hao Zhang, Zhendong Pang, Jiangpeng Wang, Teng Li

Abstract: Deep neural networks (DNNs) that tackle the time series classification (TSC) task have provided a promising framework in signal processing. In real-world applications, as a data-driven model, DNNs are suffered from insufficient data. Few-shot learning has been studied to deal with this limitation. In this paper, we propose a novel few-shot learning framework through data augmentation, which involv… ▽ More Deep neural networks (DNNs) that tackle the time series classification (TSC) task have provided a promising framework in signal processing. In real-world applications, as a data-driven model, DNNs are suffered from insufficient data. Few-shot learning has been studied to deal with this limitation. In this paper, we propose a novel few-shot learning framework through data augmentation, which involves transformation through the time-frequency domain and the generation of synthetic images through random erasing. Additionally, we develop a sequence-spectrogram neural network (SSNN). This neural network model composes of two sub-networks: one utilizing 1D residual blocks to extract features from the input sequence while the other one employing 2D residual blocks to extract features from the spectrogram representation. In the experiments, comparison studies of different existing DNN models with/without data augmentation are conducted on an amyotrophic lateral sclerosis (ALS) dataset and a wind turbine fault (WTF) dataset. The experimental results manifest that our proposed method achieves 93.75% F1 score and 93.33% accuracy on the ALS datasets while 95.48% F1 score and 95.59% accuracy on the WTF datasets. Our methodology demonstrates its applicability of addressing the few-shot problems for time series classification. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.12973 [pdf, other]

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

Authors: Ziqi Pang, Ziyang Xie, Yunze Man, Yu-Xiong Wang

Abstract: This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tok… ▽ More This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding. △ Less

Submitted 6 May, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

Comments: ICLR 2024 Spotlight. 23 pages, 13 figures. Code at https://github.com/ziqipang/LM4VisualEncoding

arXiv:2310.07405 [pdf, ps, other]

IRS Assisted Federated Learning A Broadband Over-the-Air Aggregation Approach

Authors: Deyou Zhang, Ming Xiao, Zhibo Pang, Lihui Wang, H. Vincent Poor

Abstract: We consider a broadband over-the-air computation empowered model aggregation approach for wireless federated learning (FL) systems and propose to leverage an intelligent reflecting surface (IRS) to combat wireless fading and noise. We first investigate the conventional node-selection based framework, where a few edge nodes are dropped in model aggregation to control the aggregation error. We analy… ▽ More We consider a broadband over-the-air computation empowered model aggregation approach for wireless federated learning (FL) systems and propose to leverage an intelligent reflecting surface (IRS) to combat wireless fading and noise. We first investigate the conventional node-selection based framework, where a few edge nodes are dropped in model aggregation to control the aggregation error. We analyze the performance of this node-selection based framework and derive an upper bound on its performance loss, which is shown to be related to the selected edge nodes. Then, we seek to minimize the mean-squared error (MSE) between the desired global gradient parameters and the actually received ones by optimizing the selected edge nodes, their transmit equalization coefficients, the IRS phase shifts, and the receive factors of the cloud server. By resorting to the matrix lifting technique and difference-of-convex programming, we successfully transform the formulated optimization problem into a convex one and solve it using off-the-shelf solvers. To improve learning performance, we further propose a weight-selection based FL framework. In such a framework, we assign each edge node a proper weight coefficient in model aggregation instead of discarding any of them to reduce the aggregation error, i.e., amplitude alignment of the received local gradient parameters from different edge nodes is not required. We also analyze the performance of this weight-selection based framework and derive an upper bound on its performance loss, followed by minimizing the MSE via optimizing the weight coefficients of the edge nodes, their transmit equalization coefficients, the IRS phase shifts, and the receive factors of the cloud server. Furthermore, we use the MNIST dataset for simulations to evaluate the performance of both node-selection and weight-selection based FL frameworks. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: This paper has been accepted by IEEE Transactions on Wireless Communications

arXiv:2310.01351 [pdf, other]

Streaming Motion Forecasting for Autonomous Driving

Authors: Ziqi Pang, Deva Ramanan, Mengtian Li, Yu-Xiong Wang

Abstract: Trajectory forecasting is a widely-studied problem for autonomous navigation. However, existing benchmarks evaluate forecasting based on independent snapshots of trajectories, which are not representative of real-world applications that operate on a continuous stream of data. To bridge this gap, we introduce a benchmark that continuously queries future trajectories on streaming data and we refer t… ▽ More Trajectory forecasting is a widely-studied problem for autonomous navigation. However, existing benchmarks evaluate forecasting based on independent snapshots of trajectories, which are not representative of real-world applications that operate on a continuous stream of data. To bridge this gap, we introduce a benchmark that continuously queries future trajectories on streaming data and we refer to it as "streaming forecasting." Our benchmark inherently captures the disappearance and re-appearance of agents, presenting the emergent challenge of forecasting for occluded agents, which is a safety-critical problem yet overlooked by snapshot-based benchmarks. Moreover, forecasting in the context of continuous timestamps naturally asks for temporal coherence between predictions from adjacent timestamps. Based on this benchmark, we further provide solutions and analysis for streaming forecasting. We propose a plug-and-play meta-algorithm called "Predictive Streamer" that can adapt any snapshot-based forecaster into a streaming forecaster. Our algorithm estimates the states of occluded agents by propagating their positions with multi-modal trajectories, and leverages differentiable filters to ensure temporal consistency. Both occlusion reasoning and temporal coherence strategies significantly improve forecasting quality, resulting in 25% smaller endpoint errors for occluded agents and 10-20% smaller fluctuations of trajectories. Our work is intended to generate interest within the community by highlighting the importance of addressing motion forecasting in its intrinsic streaming setting. Code is available at https://github.com/ziqipang/StreamingForecasting. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: IROS 2023, 8 pages, 9 figures

arXiv:2310.00033 [pdf]

OriWheelBot: An origami-wheeled robot

Authors: Jie Liu, Zufeng Pang, Zhiyong Li, Guilin Wen, Zhoucheng Su, Junfeng He, Kaiyue Liu, Dezheng Jiang, Zenan Li, Shouyan Chen, Yang Tian, Yi Min Xie, Zhenpei Wang, Zhuangjian Liu

Abstract: Origami-inspired robots with multiple advantages, such as being lightweight, requiring less assembly, and exhibiting exceptional deformability, have received substantial and sustained attention. However, the existing origami-inspired robots are usually of limited functionalities and develo** feature-rich robots is very challenging. Here, we report an origami-wheeled robot (OriWheelBot) with vari… ▽ More Origami-inspired robots with multiple advantages, such as being lightweight, requiring less assembly, and exhibiting exceptional deformability, have received substantial and sustained attention. However, the existing origami-inspired robots are usually of limited functionalities and develo** feature-rich robots is very challenging. Here, we report an origami-wheeled robot (OriWheelBot) with variable width and outstanding sand walking versatility. The OriWheelBot's ability to adjust wheel width over obstacles is achieved by origami wheels made of Miura origami. An improved version, called iOriWheelBot, is also developed to automatically judge the width of the obstacles. Three actions, namely direct pass, variable width pass, and direct return, will be carried out depending on the width of the channel between the obstacles. We have identified two motion mechanisms, i.e., sand-digging and sand-pushing, with the latter being more conducive to walking on the sand. We have systematically examined numerous sand walking characteristics, including carrying loads, climbing a slope, walking on a slope, and navigating sand pits, small rocks, and sand traps. The OriWheelBot can change its width by 40%, has a loading-carrying ratio of 66.7% on flat sand and can climb a 17-degree sand incline. The OriWheelBot can be useful for planetary subsurface exploration and disaster area rescue. △ Less

Submitted 29 September, 2023; originally announced October 2023.

Comments: 23 papes, 7 figures

arXiv:2307.15984 [pdf, other]

VATP360: Viewport Adaptive 360-Degree Video Streaming based on Tile Priority

Authors: Zhiyu Pang

Abstract: 360-degree video becomes increasingly popular among users. In the current network bandwidth, serving high resolution 360 degree video to users is quite difficult. Most of the work has been devoted to the prediction of user viewports or tile-based adaptive algorithms. However, it is difficult to predict user viewports more accurately using only information such as user's historical viewports or vid… ▽ More 360-degree video becomes increasingly popular among users. In the current network bandwidth, serving high resolution 360 degree video to users is quite difficult. Most of the work has been devoted to the prediction of user viewports or tile-based adaptive algorithms. However, it is difficult to predict user viewports more accurately using only information such as user's historical viewports or video saliency maps. In this paper, we propose a viewport adaptive 360-degree video streaming method based on tile priority (VATP360), which tries to balance between the performance and the overhead. The proposed VATP360 consists of three main modules: viewport prediction, tile priority classification and bitrate allocation. In the viewport prediction module, object motion trajectory and predicted user's region-of-interest (ROI) are used to achieve accurate prediction of the user's future viewport. Then, the predicted viewport, along with the object motion trajectory, are fed into the proposed tile priority classification algorithm to assign different priorities to tiles, which would reduce the computational complexity of the bitrate allocation module. Finally in the bitrate allocation stage, we adaptively assign bitrates to tiles of different priority by reinforcement learning. Experimental results on publicly available datasets have demonstrated the effectiveness of the proposed method. △ Less

Submitted 27 August, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

arXiv:2306.11011 [pdf, other]

virtCCA: Virtualized Arm Confidential Compute Architecture with TrustZone

Authors: Xiangyi Xu, Wenhao Wang, Yongzheng Wu, Chenyu Wang, Huifeng Zhu, Haocheng Ma, Zhennan Min, Zixuan Pang, Rui Hou, Yier **

Abstract: ARM recently introduced the Confidential Compute Architecture (CCA) as part of the upcoming ARMv9-A architecture. CCA enables the support of confidential virtual machines (cVMs) within a separate world called the Realm world, providing protection from the untrusted normal world. While CCA offers a promising future for confidential computing, the widespread availability of CCA hardware is not expec… ▽ More ARM recently introduced the Confidential Compute Architecture (CCA) as part of the upcoming ARMv9-A architecture. CCA enables the support of confidential virtual machines (cVMs) within a separate world called the Realm world, providing protection from the untrusted normal world. While CCA offers a promising future for confidential computing, the widespread availability of CCA hardware is not expected in the near future, according to ARM's roadmap. To address this gap, we present virtCCA, an architecture that facilitates virtualized CCA using TrustZone, a mature hardware feature available on existing ARM platforms. Notably, virtCCA can be implemented on platforms equipped with the Secure EL2 (S-EL2) extension available from ARMv8.4 onwards, as well as on earlier platforms that lack S-EL2 support. virtCCA is fully compatible with the CCA specifications at the API level. We have developed the entire CCA software and firmware stack on top of virtCCA, including the enhancements to the normal world's KVM to support cVMs, and the TrustZone Management Monitor (TMM) that enforces isolation among cVMs and provides cVM life-cycle management. We have implemented virtCCA on real ARM servers, with and without S-EL2 support. Our evaluation, conducted on micro-benchmarks and macro-benchmarks, demonstrates that the overhead of running cVMs is acceptable compared to running normal-world VMs. Specifically, in a set of real-world workloads, the overhead of virtCCA-SEL2 is less than 29.5% for I/O intensive workloads, while virtCCA-EL3 outperforms the baseline in most cases. △ Less

Submitted 17 February, 2024; v1 submitted 19 June, 2023; originally announced June 2023.

arXiv:2305.08851 [pdf, other]

MV-Map: Offboard HD-Map Generation with Multi-view Consistency

Authors: Ziyang Xie, Ziqi Pang, Yu-Xiong Wang

Abstract: While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from different viewpoints. This is because BEV perception is typically set up in an 'onboard' manner, which restricts the computation and consequently prevents algorithms… ▽ More While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from different viewpoints. This is because BEV perception is typically set up in an 'onboard' manner, which restricts the computation and consequently prevents algorithms from reasoning multiple views simultaneously. This paper overcomes these limitations and advocates a more practical 'offboard' HD-Map generation setup that removes the computation constraints, based on the fact that HD-Maps are commonly reusable infrastructures built offline in data centers. To this end, we propose a novel offboard pipeline called MV-Map that capitalizes multi-view consistency and can handle an arbitrary number of frames with the key design of a 'region-centric' framework. In MV-Map, the target HD-Maps are created by aggregating all the frames of onboard predictions, weighted by the confidence scores assigned by an 'uncertainty network'. To further enhance multi-view consistency, we augment the uncertainty network with the global 3D structure optimized by a voxelized neural radiance field (Voxel-NeRF). Extensive experiments on nuScenes show that our MV-Map significantly improves the quality of HD-Maps, further highlighting the importance of offboard methods for HD-Map generation. △ Less

Submitted 8 October, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: ICCV 2023

arXiv:2302.03802 [pdf, other]

Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking

Authors: Ziqi Pang, Jie Li, Pavel Tokmakov, Dian Chen, Sergey Zagoruyko, Yu-Xiong Wang

Abstract: This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with objec… ▽ More This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our "Past Reasoning" module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The "Future Reasoning" module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at https://github.com/TRI-ML/PF-Track. △ Less

Submitted 3 April, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

Comments: CVPR 2023 Camera Ready, 15 pages, 8 figures

arXiv:2212.00998 [pdf, other]

Credit Assignment for Trained Neural Networks Based on Koopman Operator Theory

Authors: Zhen Liang, Changyuan Zhao, Wanwei Liu, Bai Xue, Wen**g Yang, Zhengbin Pang

Abstract: Credit assignment problem of neural networks refers to evaluating the credit of each network component to the final outputs. For an untrained neural network, approaches to tackling it have made great contributions to parameter update and model revolution during the training phase. This problem on trained neural networks receives rare attention, nevertheless, it plays an increasingly important role… ▽ More Credit assignment problem of neural networks refers to evaluating the credit of each network component to the final outputs. For an untrained neural network, approaches to tackling it have made great contributions to parameter update and model revolution during the training phase. This problem on trained neural networks receives rare attention, nevertheless, it plays an increasingly important role in neural network patch, specification and verification. Based on Koopman operator theory, this paper presents an alternative perspective of linear dynamics on dealing with the credit assignment problem for trained neural networks. Regarding a neural network as the composition of sub-dynamics series, we utilize step-delay embedding to capture snapshots of each component, characterizing the established map** as exactly as possible. To circumvent the dimension-difference problem encountered during the embedding, a composition and decomposition of an auxiliary linear layer, termed minimal linear dimension alignment, is carefully designed with rigorous formal guarantee. Afterwards, each component is approximated by a Koopman operator and we derive the Jacobian matrix and its corresponding determinant, similar to backward propagation. Then, we can define a metric with algebraic interpretability for the credit assignment of each network component. Moreover, experiments conducted on typical neural networks demonstrate the effectiveness of the proposed method. △ Less

Submitted 2 December, 2022; originally announced December 2022.

Comments: 9 pages, 4 figures

MSC Class: 68T01 ACM Class: I.2.0

arXiv:2211.10056 [pdf, other]

Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Authors: Zongshang Pang, Yuta Nakashima, Mayu Otani, Hajime Nagahara

Abstract: Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficie… ▽ More Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficient and seek to directly quantify the frame-level importance with the help of contrastive losses in the representation learning literature. Leveraging the contrastive losses, we propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. With features pre-trained on the image classification task, the metrics can already yield high-quality importance scores, demonstrating competitive or better performance than past heavily-trained methods. We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved, and the model can also leverage a large number of random videos and generalize to test videos with decent performance. Code available at https://github.com/pangzss/pytorch-CTVSUM. △ Less

Submitted 18 November, 2022; originally announced November 2022.

Comments: To appear in WACV2023

arXiv:2209.11553 [pdf, other]

On Efficient Reinforcement Learning for Full-length Game of StarCraft II

Authors: Ruo-Ze Liu, Zhen-Jia Pang, Zhou-Yu Meng, Wenhai Wang, Yang Yu, Tong Lu

Abstract: StarCraft II (SC2) poses a grand challenge for reinforcement learning (RL), of which the main difficulties include huge state space, varying action space, and a long time horizon. In this work, we investigate a set of RL techniques for the full-length game of StarCraft II. We investigate a hierarchical RL approach involving extracted macro-actions and a hierarchical architecture of neural networks… ▽ More StarCraft II (SC2) poses a grand challenge for reinforcement learning (RL), of which the main difficulties include huge state space, varying action space, and a long time horizon. In this work, we investigate a set of RL techniques for the full-length game of StarCraft II. We investigate a hierarchical RL approach involving extracted macro-actions and a hierarchical architecture of neural networks. We investigate a curriculum transfer training procedure and train the agent on a single machine with 4 GPUs and 48 CPU threads. On a 64x64 map and using restrictive units, we achieve a win rate of 99% against the level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat models, we achieve a 93% win rate against the most difficult non-cheating level built-in AI (level-7). In this extended version of the paper, we improve our architecture to train the agent against the cheating level AIs and achieve the win rate against the level-8, level-9, and level-10 AIs as 96%, 97%, and 94%, respectively. Our codes are at https://github.com/liuruoze/HierNet-SC2. To provide a baseline referring the AlphaStar for our work as well as the research and open-source community, we reproduce a scaled-down version of it, mini-AlphaStar (mAS). The latest version of mAS is 1.07, which can be trained on the raw action space which has 564 actions. It is designed to run training on a single common machine, by making the hyper-parameters adjustable. We then compare our work with mAS using the same resources and show that our method is more effective. The codes of mini-AlphaStar are at https://github.com/liuruoze/mini-AlphaStar. We hope our study could shed some light on the future research of efficient reinforcement learning on SC2 and other large-scale games. △ Less

Submitted 23 September, 2022; originally announced September 2022.

Comments: 48 pages,21 figures

Journal ref: JAIR, 75 (2022), 213-260

arXiv:2207.06718 [pdf]

doi 10.1109/IECON49645.2022.9968471

Hardware-in-the-Loop Simulation for Evaluating Communication Impacts on the Wireless-Network-Controlled Robots

Authors: Honghao Lv, Zhibo Pang, Ming Xiao, Geng Yang

Abstract: More and more robot automation applications have changed to wireless communication, and network performance has a growing impact on robotic systems. This study proposes a hardware-in-the-loop (HiL) simulation methodology for connecting the simulated robot platform to real network devices. This project seeks to provide robotic engineers and researchers with the capability to experiment without heav… ▽ More More and more robot automation applications have changed to wireless communication, and network performance has a growing impact on robotic systems. This study proposes a hardware-in-the-loop (HiL) simulation methodology for connecting the simulated robot platform to real network devices. This project seeks to provide robotic engineers and researchers with the capability to experiment without heavily modifying the original controller and get more realistic test results that correlate with actual network conditions. We deployed this HiL simulation system in two common cases for wireless-network-controlled robotic applications: (1) safe multi-robot coordination for mobile robots, and (2) human-motion-based teleoperation for manipulators. The HiL simulation system is deployed and tested under various network conditions in all circumstances. The experiment results are analyzed and compared with the previous simulation methods, demonstrating that the proposed HiL simulation methodology can identify a more reliable communication impact on robot systems. △ Less

Submitted 28 September, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

Comments: 6 pages, 11 figures, to appear in 48th Annual Conference of the Industrial Electronics Society IECON 2022 Conference

arXiv:2207.05267 [pdf]

doi 10.1364/OE.470529

Indoor optical fiber eavesdrop** approach and its avoidance

Authors: Haiqing Hao, Zhongwang Pang, Guan Wang, Bo Wang

Abstract: The optical fiber network has become a worldwide infrastructure. In addition to the basic functions in telecommunication, its sensing ability has attracted more and more attention. In this paper, we discuss the risk of household fiber being used for eavesdrop** and demonstrate its performance in the lab. Using a 3-meter tail fiber in front of the household optical modem, voices of normal human s… ▽ More The optical fiber network has become a worldwide infrastructure. In addition to the basic functions in telecommunication, its sensing ability has attracted more and more attention. In this paper, we discuss the risk of household fiber being used for eavesdrop** and demonstrate its performance in the lab. Using a 3-meter tail fiber in front of the household optical modem, voices of normal human speech can be eavesdropped by a laser interferometer and recovered 1.1 km away. The detection distance limit and system noise are analyzed quantitatively. We also give some practical ways to prevent eavesdrop** through household fiber. △ Less

Submitted 3 August, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

Comments: 8 pages, 4 figures, submitted to Optics Express

arXiv:2203.00770 [pdf]

Short-Packet Interleaver against Impulse Interference in Practical Industrial Environments

Authors: Ming Zhan, Zhibo Pang, Dacfey Dzung, Kan Yu, Ming Xiao

Abstract: The most common cause of transmission failure in Wireless High Performance (WirelessHP) target industry environments is impulse interference. As interleavers are commonly used to improve the reliability on the Orthogonal Frequency Division Multiplexing (OFDM) symbol level for long packet transmission, this paper considers the feasibility of applying short-packet bit interleaving to enhance the imp… ▽ More The most common cause of transmission failure in Wireless High Performance (WirelessHP) target industry environments is impulse interference. As interleavers are commonly used to improve the reliability on the Orthogonal Frequency Division Multiplexing (OFDM) symbol level for long packet transmission, this paper considers the feasibility of applying short-packet bit interleaving to enhance the impulse/burst interference resisting capability on both OFDM symbol and frame level. Using the Universal Software Radio Peripherals (USRP) and PC hardware platform, the Packet Error Rate (PER) performance of interleaved coded short-packet transmission with Convolutional Codes (CC), Reed-Solomon codes (RS) and RS+CC concatenated codes are tested and analyzed. Applying the IEEE 1613 standard for impulse interference generation, extensive PER tests of CC(1=2) and RS(31; 21)+CC(1=2) concatenated codes are performed. With practical experiments, we prove the effectiveness of bit in terleaved coded short-packet transmission in real factory environments. We also investigate how PER performance depends on the interleavers, codes and impulse interference power and frequency. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: 14 pages, 12 figures, submitted to IEEE Transactions on Wireless Communications

arXiv:2112.06375 [pdf, other]

Embracing Single Stride 3D Object Detector with Sparse Transformer

Authors: Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, Zhaoxiang Zhang

Abstract: In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases. Overlooking this difference, many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps even after quantizing the point clouds. In this paper, we start by rethinking how such multi-stride s… ▽ More In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases. Overlooking this difference, many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps even after quantizing the point clouds. In this paper, we start by rethinking how such multi-stride stereotype affects the LiDAR-based 3D object detectors. Our experiments point out that the downsampling operations bring few advantages, and lead to inevitable information loss. To remedy this issue, we propose Single-stride Sparse Transformer (SST) to maintain the original resolution from the beginning to the end of the network. Armed with transformers, our method addresses the problem of insufficient receptive field in single-stride architectures. It also cooperates well with the sparsity of point clouds and naturally avoids expensive computation. Eventually, our SST achieves state-of-the-art results on the large scale Waymo Open Dataset. It is worth mentioning that our method can achieve exciting performance (83.8 LEVEL 1 AP on validation split) on small object (pedestrian) detection due to the characteristic of single stride. Codes will be released at https://github.com/TuSimple/SST △ Less

Submitted 12 December, 2021; originally announced December 2021.

arXiv:2111.13672 [pdf, other]

Immortal Tracker: Tracklet Never Dies

Authors: Qitai Wang, Yuntao Chen, Ziqi Pang, Naiyan Wang, Zhaoxiang Zhang

Abstract: Previous online 3D Multi-Object Tracking(3DMOT) methods terminate a tracklet when it is not associated with new detections for a few frames. But if an object just goes dark, like being temporarily occluded by other objects or simply getting out of FOV, terminating a tracklet prematurely will result in an identity switch. We reveal that premature tracklet termination is the main cause of identity s… ▽ More Previous online 3D Multi-Object Tracking(3DMOT) methods terminate a tracklet when it is not associated with new detections for a few frames. But if an object just goes dark, like being temporarily occluded by other objects or simply getting out of FOV, terminating a tracklet prematurely will result in an identity switch. We reveal that premature tracklet termination is the main cause of identity switches in modern 3DMOT systems. To address this, we propose Immortal Tracker, a simple tracking system that utilizes trajectory prediction to maintain tracklets for objects gone dark. We employ a simple Kalman filter for trajectory prediction and preserve the tracklet by prediction when the target is not visible. With this method, we can avoid 96% vehicle identity switches resulting from premature tracklet termination. Without any learned parameters, our method achieves a mismatch ratio at the 0.0001 level and competitive MOTA for the vehicle class on the Waymo Open Dataset test set. Our mismatch ratio is tens of times lower than any previously published method. Similar results are reported on nuScenes. We believe the proposed Immortal Tracker can offer a simple yet powerful solution for pushing the limit of 3DMOT. Our code is available at https://github.com/ImmortalTracker/ImmortalTracker. △ Less

Submitted 26 November, 2021; originally announced November 2021.

arXiv:2111.10586 [pdf, other]

Satellite Based Computing Networks with Federated Learning

Authors: Hao Chen, Ming Xiao, Zhibo Pang

Abstract: Driven by the ever-increasing penetration and proliferation of data-driven applications, a new generation of wireless communication, the sixth-generation (6G) mobile system enhanced by artificial intelligence (AI), has attracted substantial research interests. Among various candidate technologies of 6G, low earth orbit (LEO) satellites have appealing characteristics of ubiquitous wireless access.… ▽ More Driven by the ever-increasing penetration and proliferation of data-driven applications, a new generation of wireless communication, the sixth-generation (6G) mobile system enhanced by artificial intelligence (AI), has attracted substantial research interests. Among various candidate technologies of 6G, low earth orbit (LEO) satellites have appealing characteristics of ubiquitous wireless access. However, the costs of satellite communication (SatCom) are still high, relative to counterparts of ground mobile networks. To support massively interconnected devices with intelligent adaptive learning and reduce expensive traffic in SatCom, we propose federated learning (FL) in LEO-based satellite communication networks. We first review the state-of-the-art LEO-based SatCom and related machine learning (ML) techniques, and then analyze four possible ways of combining ML with satellite networks. The learning performance of the proposed strategies is evaluated by simulation and results reveal that FL-based computing networks improve the performance of communication overheads and latency. Finally, we discuss future research topics along this research direction. △ Less

Submitted 20 November, 2021; originally announced November 2021.

arXiv:2111.09621 [pdf, other]

SimpleTrack: Understanding and Rethinking 3D Multi-object Tracking

Authors: Ziqi Pang, Zhichao Li, Naiyan Wang

Abstract: 3D multi-object tracking (MOT) has witnessed numerous novel benchmarks and approaches in recent years, especially those under the "tracking-by-detection" paradigm. Despite their progress and usefulness, an in-depth analysis of their strengths and weaknesses is not yet available. In this paper, we summarize current 3D MOT methods into a unified framework by decomposing them into four constituent pa… ▽ More 3D multi-object tracking (MOT) has witnessed numerous novel benchmarks and approaches in recent years, especially those under the "tracking-by-detection" paradigm. Despite their progress and usefulness, an in-depth analysis of their strengths and weaknesses is not yet available. In this paper, we summarize current 3D MOT methods into a unified framework by decomposing them into four constituent parts: pre-processing of detection, association, motion model, and life cycle management. We then ascribe the failure cases of existing algorithms to each component and investigate them in detail. Based on the analyses, we propose corresponding improvements which lead to a strong yet simple baseline: SimpleTrack. Comprehensive experimental results on Waymo Open Dataset and nuScenes demonstrate that our final method could achieve new state-of-the-art results with minor modifications. Furthermore, we take additional steps and rethink whether current benchmarks authentically reflect the ability of algorithms for real-world challenges. We delve into the details of existing benchmarks and find some intriguing facts. Finally, we analyze the distribution and causes of remaining failures in \name\ and propose future directions for 3D MOT. Our code is available at https://github.com/TuSimple/SimpleTrack. △ Less

Submitted 18 November, 2021; originally announced November 2021.

arXiv:2111.00695 [pdf]

Noise Error Pattern Generation Based on Successive Addition-Subtraction for Guessing Decoding

Authors: Ming Zhan, Zhibo Pang, Kan Yu, **g Xu, Fang Wu

Abstract: Guessing random additive noise decoding (GRAND) algorithm has emerged as an excellent decoding strategy that can meet both the high reliability and low latency constraints. This paper proposes a successive addition-subtraction algorithm to generate noise error permutations. A noise error patterns generation scheme is presented by embedding the "1" and "0" bursts alternately. Then detailed procedur… ▽ More Guessing random additive noise decoding (GRAND) algorithm has emerged as an excellent decoding strategy that can meet both the high reliability and low latency constraints. This paper proposes a successive addition-subtraction algorithm to generate noise error permutations. A noise error patterns generation scheme is presented by embedding the "1" and "0" bursts alternately. Then detailed procedures of the proposed algorithm are presented, and its correctness is also demonstrated through theoretical derivations. The aim of this work is to provide a preliminary paradigm and reference for future research on GRAND algorithm and hardware implementation. △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: 6 pages, 7 figures, submitted to IEEE Communications Letters

arXiv:2109.05889 [pdf, other]

doi 10.1109/LGRS.2021.3124804

Nonlocal Patch-Based Fully-Connected Tensor Network Decomposition for Remote Sensing Image Inpainting

Authors: Wen-Jie Zheng, Xi-Le Zhao, Yu-Bang Zheng, Zhi-Feng Pang

Abstract: Remote sensing image (RSI) inpainting plays an important role in real applications. Recently, fully-connected tensor network (FCTN) decomposition has been shown the remarkable ability to fully characterize the global correlation. Considering the global correlation and the nonlocal self-similarity (NSS) of RSIs, this paper introduces the FCTN decomposition to the whole RSI and its NSS groups, and p… ▽ More Remote sensing image (RSI) inpainting plays an important role in real applications. Recently, fully-connected tensor network (FCTN) decomposition has been shown the remarkable ability to fully characterize the global correlation. Considering the global correlation and the nonlocal self-similarity (NSS) of RSIs, this paper introduces the FCTN decomposition to the whole RSI and its NSS groups, and proposes a novel nonlocal patch-based FCTN (NL-FCTN) decomposition for RSI inpainting. Different from other nonlocal patch-based methods, the NL-FCTN decomposition-based method, which increases tensor order by stacking similar small-sized patches to NSS groups, cleverly leverages the remarkable ability of FCTN decomposition to deal with higher-order tensors. Besides, we propose an efficient proximal alternating minimization-based algorithm to solve the proposed NL-FCTN decomposition-based model with a theoretical convergence guarantee. Extensive experiments on RSIs demonstrate that the proposed method achieves the state-of-the-art inpainting performance in all compared methods. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Journal ref: IEEE Geoscience and Remote Sensing Letters, 2021

arXiv:2103.11441 [pdf, other]

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

Authors: Tao Gui, Xiao Wang, Qi Zhang, Qin Liu, Yicheng Zou, Xin Zhou, Rui Zheng, Chong Zhang, Qinzhuo Wu, Jiacheng Ye, Zexiong Pang, Yongxin Zhang, Zhengyan Li, Ruotian Ma, Zichu Fei, Ruijian Cai, Jun Zhao, Xingwu Hu, Zhiheng Yan, Yiding Tan, Yuan Hu, Qiyuan Bian, Zhihua Liu, Bolin Zhu, Shan Qin , et al. (9 additional authors not shown)

Abstract: Various robustness evaluation methodologies from different perspectives have been proposed for different natural language processing (NLP) tasks. These methods have often focused on either universal or task-specific generalization capabilities. In this work, we propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) that incorporates universal text transformation, task-spec… ▽ More Various robustness evaluation methodologies from different perspectives have been proposed for different natural language processing (NLP) tasks. These methods have often focused on either universal or task-specific generalization capabilities. In this work, we propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint enables practitioners to automatically evaluate their models from all aspects or to customize their evaluations as desired with just a few lines of code. To guarantee user acceptability, all the text transformations are linguistically based, and we provide a human evaluation for each one. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness. To validate TextFlint's utility, we performed large-scale empirical evaluations (over 67,000 evaluations) on state-of-the-art deep learning models, classic supervised methods, and real-world systems. Almost all models showed significant performance degradation, including a decline of more than 50% of BERT's prediction accuracy on tasks such as aspect-level sentiment classification, named entity recognition, and natural language inference. Therefore, we call for the robustness to be included in the model evaluation, so as to promote the healthy development of NLP technology. △ Less

Submitted 5 May, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

arXiv:2103.06028 [pdf, other]

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences

Authors: Ziqi Pang, Zhichao Li, Naiyan Wang

Abstract: Estimating the states of surrounding traffic participants stays at the core of autonomous driving. In this paper, we study a novel setting of this problem: model-free single-object tracking (SOT), which takes the object state in the first frame as input, and jointly solves state estimation and tracking in subsequent frames. The main purpose for this new setting is to break the strong limitation of… ▽ More Estimating the states of surrounding traffic participants stays at the core of autonomous driving. In this paper, we study a novel setting of this problem: model-free single-object tracking (SOT), which takes the object state in the first frame as input, and jointly solves state estimation and tracking in subsequent frames. The main purpose for this new setting is to break the strong limitation of the popular "detection and tracking" scheme in multi-object tracking. Moreover, we notice that shape completion by overlaying the point clouds, which is a by-product of our proposed task, not only improves the performance of state estimation but also has numerous applications. As no benchmark for this task is available so far, we construct a new dataset LiDAR-SOT and corresponding evaluation protocols based on the Waymo Open dataset. We then propose an optimization-based algorithm called SOTracker involving point cloud registration, vehicle shapes, correspondence, and motion priors. Our quantitative and qualitative results prove the effectiveness of our SOTracker and reveal the challenging cases for SOT in point clouds, including the sparsity of LiDAR data, abrupt motion variation, etc. Finally, we also explore how the proposed task and algorithm may benefit other autonomous driving applications, including simulating LiDAR scans, generating motion data, and annotating optical flow. The code and protocols for our benchmark and algorithm are available at https://github.com/TuSimple/LiDAR_SOT/. A video demonstration is at https://www.youtube.com/watch?v=BpHixKs91i8. △ Less

Submitted 5 August, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

Comments: Accepted by IROS2021, Camera ready version

arXiv:2102.01955 [pdf, other]

Predictive coding feedback results in perceived illusory contours in a recurrent neural network

Authors: Zhaoyang Pang, Callum Biggs O'May, Bhavin Choksi, Rufin VanRullen

Abstract: Modern feedforward convolutional neural networks (CNNs) can now solve some computer vision tasks at super-human levels. However, these networks only roughly mimic human visual perception. One difference from human vision is that they do not appear to perceive illusory contours (e.g. Kanizsa squares) in the same way humans do. Physiological evidence from visual cortex suggests that the perception o… ▽ More Modern feedforward convolutional neural networks (CNNs) can now solve some computer vision tasks at super-human levels. However, these networks only roughly mimic human visual perception. One difference from human vision is that they do not appear to perceive illusory contours (e.g. Kanizsa squares) in the same way humans do. Physiological evidence from visual cortex suggests that the perception of illusory contours could involve feedback connections. Would recurrent feedback neural networks perceive illusory contours like humans? In this work we equip a deep feedforward convolutional network with brain-inspired recurrent dynamics. The network was first pretrained with an unsupervised reconstruction objective on a natural image dataset, to expose it to natural object contour statistics. Then, a classification decision layer was added and the model was finetuned on a form discrimination task: squares vs. randomly oriented inducer shapes (no illusory contour). Finally, the model was tested with the unfamiliar ''illusory contour'' configuration: inducer shapes oriented to form an illusory square. Compared with feedforward baselines, the iterative ''predictive coding'' feedback resulted in more illusory contours being classified as physical squares. The perception of the illusory contour was measurable in the luminance profile of the image reconstructions produced by the model, demonstrating that the model really ''sees'' the illusion. Ablation studies revealed that natural image pretraining and feedback error correction are both critical to the perception of the illusion. Finally we validated our conclusions in a deeper network (VGG): adding the same predictive coding feedback dynamics again leads to the perception of illusory contours. △ Less

Submitted 16 June, 2021; v1 submitted 3 February, 2021; originally announced February 2021.

Comments: Manuscript under review

arXiv:2012.07748 [pdf]

Investigation of the Impacts of COVID-19 on the Electricity Consumption of a University Dormitory Using Weather Normalization

Authors: Zhihong Pang, Fan Feng, Zheng O'Neill

Abstract: This study investigated the impacts of the COVID-19 pandemic on the electricity consumption of a university dormitory building in the southern U.S. The historical electricity consumption data of this university dormitory building and weather data of an on-campus weather station, which were collected from January 1st, 2017 to July 31st, 2020, were used for analysis. Four inverse data-driven predict… ▽ More This study investigated the impacts of the COVID-19 pandemic on the electricity consumption of a university dormitory building in the southern U.S. The historical electricity consumption data of this university dormitory building and weather data of an on-campus weather station, which were collected from January 1st, 2017 to July 31st, 2020, were used for analysis. Four inverse data-driven prediction models, i.e., Artificial Neural Network, Long Short-Term Memory Recurrent Neural Network, eXtreme Gradient Boosting, and Light Gradient Boosting Machine, were exploited to account for the influence of the weather conditions. The results suggested that the total electricity consumption of the objective building decreased by nearly 41% (about 276,000 kWh (942 MMBtu)) compared with the prediction value during the campus shutdown due to the COVID-19. Besides, the daily load ratio (DLR) varied significantly as well. In general, the DLR decreased gradually from 80% to nearly 40% in the second half of March 2020, maintained on a relatively stable level between 30% to 60% in April, May, and June 2020, and then slowly recovered to 80% of the normal capacity in July 2020. △ Less

Submitted 4 December, 2020; originally announced December 2020.

arXiv:2007.07437 [pdf]

ContourRend: A Segmentation Method for Improving Contours by Rendering

Authors: Junwen Chen, Yi Lu, Yaran Chen, Dongbin Zhao, Zhonghua Pang

Abstract: A good object segmentation should contain clear contours and complete regions. However, mask-based segmentation can not handle contour features well on a coarse prediction grid, thus causing problems of blurry edges. While contour-based segmentation provides contours directly, but misses contours' details. In order to obtain fine contours, we propose a segmentation method named ContourRend which a… ▽ More A good object segmentation should contain clear contours and complete regions. However, mask-based segmentation can not handle contour features well on a coarse prediction grid, thus causing problems of blurry edges. While contour-based segmentation provides contours directly, but misses contours' details. In order to obtain fine contours, we propose a segmentation method named ContourRend which adopts a contour renderer to refine segmentation contours. And we implement our method on a segmentation model based on graph convolutional network (GCN). For the single object segmentation task on cityscapes dataset, the GCN-based segmentation con-tour is used to generate a contour of a single object, then our contour renderer focuses on the pixels around the contour and predicts the category at high resolution. By rendering the contour result, our method reaches 72.41% mean intersection over union (IoU) and surpasses baseline Polygon-GCN by 1.22%. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2003.11941 [pdf, other]

AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online

Authors: Guangda Huzhang, Zhen-Jia Pang, Yongqing Gao, Yawen Liu, Weijie Shen, Wen-Ji Zhou, Qing Da, An-Xiang Zeng, Han Yu, Yang Yu, Zhi-Hua Zhou

Abstract: Learning-to-rank (LTR) has become a key technology in E-commerce applications. Most existing LTR approaches follow a supervised learning paradigm from offline labeled data collected from the online system. However, it has been noticed that previous LTR models can have a good validation performance over offline validation data but have a poor online performance, and vice versa, which implies a poss… ▽ More Learning-to-rank (LTR) has become a key technology in E-commerce applications. Most existing LTR approaches follow a supervised learning paradigm from offline labeled data collected from the online system. However, it has been noticed that previous LTR models can have a good validation performance over offline validation data but have a poor online performance, and vice versa, which implies a possible large inconsistency between the offline and online evaluation. We investigate and confirm in this paper that such inconsistency exists and can have a significant impact on AliExpress Search. Reasons for the inconsistency include the ignorance of item context during the learning, and the offline data set is insufficient for learning the context. Therefore, this paper proposes an evaluator-generator framework for LTR with item context. The framework consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning, and a discriminator that ensures the generalization of the evaluator. Extensive experiments in simulation environments and AliExpress Search online system show that, firstly, the classic data-based metrics on the offline dataset can show significant inconsistency with online performance, and can even be misleading. Secondly, the proposed evaluator score is significantly more consistent with the online performance than common ranking metrics. Finally, as the consequence, our method achieves a significant improvement (\textgreater$2\%$) in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests. △ Less

Submitted 31 December, 2020; v1 submitted 25 March, 2020; originally announced March 2020.

arXiv:1912.07186 [pdf, other]

Minimizing Age of Information for Real-Time Monitoring in Resource-Constrained Industrial IoT Networks

Authors: Qian Wang, He Chen, Yonghui Li, Zhibo Pang, Branka Vucetic

Abstract: This paper considers an Industrial Internet of Thing (IIoT) system with a source monitoring a dynamic process with randomly generated status updates. The status updates are sent to an designated destination in a real-time manner over an unreliable link. The source is subject to a practical constraint of limited average transmission power. Thus, the system should carefully schedule when to transmit… ▽ More This paper considers an Industrial Internet of Thing (IIoT) system with a source monitoring a dynamic process with randomly generated status updates. The status updates are sent to an designated destination in a real-time manner over an unreliable link. The source is subject to a practical constraint of limited average transmission power. Thus, the system should carefully schedule when to transmit a fresh status update or retransmit the stale one. To characterize the performance of timely status update, we adopt a recent concept, Age of Information (AoI), as the performance metric. We aim to minimize the long-term average AoI under the limited average transmission power at the source, by formulating a constrained Markov Decision Process (CMDP) problem. To address the formulated CMDP, we recast it into an unconstrained Markov Decision Process (MDP) through Lagrangian relaxation. We prove the existence of optimal stationary policy of the original CMDP, which is a randomized mixture of two deterministic stationary policies of the unconstrained MDP. We also explore the characteristics of the problem to reduce the action space of each state to significantly reduce the computation complexity. We further prove the threshold structure of the optimal deterministic policy for the unconstrained MDP. Simulation results show the proposed optimal policy achieves lower average AoI compared with random policy, especially when the system suffers from stricter resource constraint. Besides, the influence of status generation probability and transmission failure rate on optimal policy and the resultant average AoI as well as the impact of average transmission power on the minimal average AoI are unveiled. △ Less

Submitted 15 December, 2019; originally announced December 2019.

arXiv:1911.12911 [pdf, other]

Unlocking the Full Potential of Small Data with Diverse Supervision

Authors: Ziqi Pang, Zhiyuan Hu, Pavel Tokmakov, Yu-Xiong Wang, Martial Hebert

Abstract: Virtually all of deep learning literature relies on the assumption of large amounts of available training data. Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" for pretraining. This assumption, however, does not always hold. For some tasks, annotating a large number of classes can be infeasible, and even collecting the images themselves can be a challen… ▽ More Virtually all of deep learning literature relies on the assumption of large amounts of available training data. Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" for pretraining. This assumption, however, does not always hold. For some tasks, annotating a large number of classes can be infeasible, and even collecting the images themselves can be a challenge in some scenarios. In this paper, we study this problem and call it "Small Data" setting, in contrast to "Big Data". To unlock the full potential of small data, we propose to augment the models with annotations for other related tasks, thus increasing their generalization abilities. In particular, we use the richly annotated scene parsing dataset ADE20K to construct our realistic Long-tail Recognition with Diverse Supervision (LRDS) benchmark by splitting the object categories into head and tail based on their distribution. Following the standard few-shot learning protocol, we use the head classes for representation learning and the tail classes for evaluation. Moreover, we further subsample the head categories and images to generate two novel settings which we call "Scarce-Class" and "Scarce-Image", respectively corresponding to the shortage of samples for rare classes and training images. Finally, we analyze the effect of applying various additional supervision sources under the proposed settings. Our experiments demonstrate that densely labeling a small set of images can indeed largely remedy the small data constraints. △ Less

Submitted 26 April, 2021; v1 submitted 28 November, 2019; originally announced November 2019.

Comments: Learning from Limited and Imperfect Data (L2ID) Workshop @ CVPR 2021

arXiv:1903.00715 [pdf, other]

Efficient Reinforcement Learning for StarCraft by Abstract Forward Models and Transfer Learning

Authors: Ruo-Ze Liu, Haifeng Guo, Xiaozhong Ji, Yang Yu, Zhen-Jia Pang, Zitai Xiao, Yuzhou Wu, Tong Lu

Abstract: Injecting human knowledge is an effective way to accelerate reinforcement learning (RL). However, these methods are underexplored. This paper presents our discovery that an abstract forward model (thought-game (TG)) combined with transfer learning (TL) is an effective way. We take StarCraft II as our study environment. With the help of a designed TG, the agent can learn a 99% win-rate on a 64x64 m… ▽ More Injecting human knowledge is an effective way to accelerate reinforcement learning (RL). However, these methods are underexplored. This paper presents our discovery that an abstract forward model (thought-game (TG)) combined with transfer learning (TL) is an effective way. We take StarCraft II as our study environment. With the help of a designed TG, the agent can learn a 99% win-rate on a 64x64 map against the Level-7 built-in AI, using only 1.08 hours in a single commercial machine. We also show that the TG method is not as restrictive as it was thought to be. It can work with roughly designed TGs, and can also be useful when the environment changes. Comparing with previous model-based RL, we show TG is more effective. We also present a TG hypothesis that gives the influence of different fidelity levels of TG. For real games that have unequal state and action spaces, we proposed a novel XfrNet of which usefulness is validated while achieving a 90% win-rate against the cheating Level-10 AI. We argue that the TG method might shed light on further studies of efficient RL with human knowledge. △ Less

Submitted 2 November, 2021; v1 submitted 2 March, 2019; originally announced March 2019.

arXiv:1811.06166 [pdf, other]

Tiyuntsong: A Self-Play Reinforcement Learning Approach for ABR Video Streaming

Authors: Tianchi Huang, Xin Yao, Chenglei Wu, Rui-Xiao Zhang, Zhangyuan Pang, Lifeng Sun

Abstract: Existing reinforcement learning~(RL)-based adaptive bitrate~(ABR) approaches outperform the previous fixed control rules based methods by improving the Quality of Experience~(QoE) score, as the QoE metric can hardly provide clear guidance for optimization, finally resulting in the unexpected strategies. In this paper, we propose \emph{Tiyuntsong}, a self-play reinforcement learning approach with g… ▽ More Existing reinforcement learning~(RL)-based adaptive bitrate~(ABR) approaches outperform the previous fixed control rules based methods by improving the Quality of Experience~(QoE) score, as the QoE metric can hardly provide clear guidance for optimization, finally resulting in the unexpected strategies. In this paper, we propose \emph{Tiyuntsong}, a self-play reinforcement learning approach with generative adversarial network~(GAN)-based method for ABR video streaming. Tiyuntsong learns strategies automatically by training two agents who are competing against each other. Note that the competition results are determined by a set of rules rather than a numerical QoE score that allows clearer optimization objectives. Meanwhile, we propose GAN Enhancement Module to extract hidden features from the past status for preserving the information without the limitations of sequence lengths. Using testbed experiments, we show that the utilization of GAN significantly improves the Tiyuntsong's performance. By comparing the performance of ABRs, we observe that Tiyuntsong also betters existing ABR algorithms in the underlying metrics. △ Less

Submitted 2 May, 2019; v1 submitted 14 November, 2018; originally announced November 2018.

Comments: Published in ICME 2019

arXiv:1809.09095 [pdf, other]

On Reinforcement Learning for Full-length Game of StarCraft

Authors: Zhen-Jia Pang, Ruo-Ze Liu, Zhou-Yu Meng, Yi Zhang, Yang Yu, Tong Lu

Abstract: StarCraft II poses a grand challenge for reinforcement learning. The main difficulties of it include huge state and action space and a long-time horizon. In this paper, we investigate a hierarchical reinforcement learning approach for StarCraft II. The hierarchy involves two levels of abstraction. One is the macro-action automatically extracted from expert's trajectories, which reduces the action… ▽ More StarCraft II poses a grand challenge for reinforcement learning. The main difficulties of it include huge state and action space and a long-time horizon. In this paper, we investigate a hierarchical reinforcement learning approach for StarCraft II. The hierarchy involves two levels of abstraction. One is the macro-action automatically extracted from expert's trajectories, which reduces the action space in an order of magnitude yet remains effective. The other is a two-layer hierarchical architecture which is modular and easy to scale, enabling a curriculum transferring from simpler tasks to more complex tasks. The reinforcement training algorithm for this architecture is also investigated. On a 64x64 map and using restrictive units, we achieve a winning rate of more than 99\% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat model, we can achieve over 93\% winning rate of Protoss against the most difficult non-cheating built-in AI (level-7) of Terran, training within two days using a single machine with only 48 CPU cores and 8 K40 GPUs. It also shows strong generalization performance, when tested against never seen opponents including cheating levels built-in AI and all levels of Zerg and Protoss built-in AI. We hope this study could shed some light on the future research of large-scale reinforcement learning. △ Less

Submitted 3 February, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

Comments: Appeared in AAAI 2019

arXiv:1808.02079 [pdf, other]

Low-latency Networking: Where Latency Lurks and How to Tame It

Authors: Xiaolin Jiang, Hossein S. Ghadikolaei, Gabor Fodor, Eytan Modiano, Zhibo Pang, Michele Zorzi, Carlo Fischione

Abstract: While the current generation of mobile and fixed communication networks has been standardized for mobile broadband services, the next generation is driven by the vision of the Internet of Things and mission critical communication services requiring latency in the order of milliseconds or sub-milliseconds. However, these new stringent requirements have a large technical impact on the design of all… ▽ More While the current generation of mobile and fixed communication networks has been standardized for mobile broadband services, the next generation is driven by the vision of the Internet of Things and mission critical communication services requiring latency in the order of milliseconds or sub-milliseconds. However, these new stringent requirements have a large technical impact on the design of all layers of the communication protocol stack. The cross layer interactions are complex due to the multiple design principles and technologies that contribute to the layers' design and fundamental performance limitations. We will be able to develop low-latency networks only if we address the problem of these complex interactions from the new point of view of sub-milliseconds latency. In this article, we propose a holistic analysis and classification of the main design principles and enabling technologies that will make it possible to deploy low-latency wireless communication networks. We argue that these design principles and enabling technologies must be carefully orchestrated to meet the stringent requirements and to manage the inherent trade-offs between low latency and traditional performance metrics. We also review currently ongoing standardization activities in prominent standards associations, and discuss open problems for future research. △ Less

Submitted 6 August, 2018; originally announced August 2018.

arXiv:1702.06700 [pdf, other]

Task-driven Visual Saliency and Attention-based Visual Question Answering

Authors: Yuetan Lin, Zhangyang Pang, Donghui Wang, Yueting Zhuang

Abstract: Visual question answering (VQA) has witnessed great progress since May, 2015 as a classic problem unifying visual and textual data into a system. Many enlightening VQA works explore deep into the image and question encodings and fusing methods, of which attention is the most effective and infusive mechanism. Current attention based methods focus on adequate fusion of visual and textual features, b… ▽ More Visual question answering (VQA) has witnessed great progress since May, 2015 as a classic problem unifying visual and textual data into a system. Many enlightening VQA works explore deep into the image and question encodings and fusing methods, of which attention is the most effective and infusive mechanism. Current attention based methods focus on adequate fusion of visual and textual features, but lack the attention to where people focus to ask questions about the image. Traditional attention based methods attach a single value to the feature at each spatial location, which losses many useful information. To remedy these problems, we propose a general method to perform saliency-like pre-selection on overlapped region features by the interrelation of bidirectional LSTM (BiLSTM), and use a novel element-wise multiplication based attention method to capture more competent correlation information between visual and textual features. We conduct experiments on the large-scale COCO-VQA dataset and analyze the effectiveness of our model demonstrated by strong empirical results. △ Less

Submitted 22 February, 2017; originally announced February 2017.

Comments: 8 pages, 3 figures

arXiv:1605.09116 [pdf, ps, other]

Image segmentation based on the hybrid total variation model and the K-means clustering strategy

Authors: Baoli Shi, Zhi-Feng Pang, **g Xu

Abstract: The performance of image segmentation highly relies on the original inputting image. When the image is contaminated by some noises or blurs, we can not obtain the efficient segmentation result by using direct segmentation methods. In order to efficiently segment the contaminated image, this paper proposes a two step method based on the hybrid total variation model with a box constraint and the K-m… ▽ More The performance of image segmentation highly relies on the original inputting image. When the image is contaminated by some noises or blurs, we can not obtain the efficient segmentation result by using direct segmentation methods. In order to efficiently segment the contaminated image, this paper proposes a two step method based on the hybrid total variation model with a box constraint and the K-means clustering method. In the first step, the hybrid model is based on the weighted convex combination between the total variation functional and the high-order total variation as the regularization term to obtain the original clustering data. In order to deal with non-smooth regularization term, we solve this model by employing the alternating split Bregman method. Then, in the second step, the segmentation can be obtained by thresholding this clustering data into different phases, where the thresholds can be given by using the K-means clustering method. Numerical comparisons show that our proposed model can provide more efficient segmentation results dealing with the noise image and blurring image. △ Less

Submitted 30 May, 2016; originally announced May 2016.

arXiv:1509.07211 [pdf, other]

Noise-Robust ASR for the third 'CHiME' Challenge Exploiting Time-Frequency Masking based Multi-Channel Speech Enhancement and Recurrent Neural Network

Authors: Zaihu Pang, Fengyun Zhu

Abstract: In this paper, the Lingban entry to the third 'CHiME' speech separation and recognition challenge is presented. A time-frequency masking based speech enhancement front-end is proposed to suppress the environmental noise utilizing multi-channel coherence and spatial cues. The state-of-the-art speech recognition techniques, namely recurrent neural network based acoustic and language modeling, state… ▽ More In this paper, the Lingban entry to the third 'CHiME' speech separation and recognition challenge is presented. A time-frequency masking based speech enhancement front-end is proposed to suppress the environmental noise utilizing multi-channel coherence and spatial cues. The state-of-the-art speech recognition techniques, namely recurrent neural network based acoustic and language modeling, state space minimum Bayes risk based discriminative acoustic modeling, and i-vector based acoustic condition modeling, are carefully integrated into the speech recognition back-end. To further improve the system performance by fully exploiting the advantages of different technologies, the final recognition results are obtained by lattice combination and rescoring. Evaluations carried out on the official dataset prove the effectiveness of the proposed systems. Comparing with the best baseline result, the proposed system obtains consistent improvements with over 57% relative word error rate reduction on the real-data test set. △ Less

Submitted 23 September, 2015; originally announced September 2015.

Comments: The 3rd 'CHiME' Speech Separation and Recognition Challenge, 5 pages, 1 figure

arXiv:1110.1804

The proximal point method for a hybrid model in image restoration

Authors: Zhi-Feng Pang, Li-Lian Wang, Yu-Fei Yang

Abstract: Models including two $L^1$ -norm terms have been widely used in image restoration. In this paper we first propose the alternating direction method of multipliers (ADMM) to solve this class of models. Based on ADMM, we then propose the proximal point method (PPM), which is more efficient than ADMM. Following the operator theory, we also give the convergence analysis of the proposed methods. Further… ▽ More Models including two $L^1$ -norm terms have been widely used in image restoration. In this paper we first propose the alternating direction method of multipliers (ADMM) to solve this class of models. Based on ADMM, we then propose the proximal point method (PPM), which is more efficient than ADMM. Following the operator theory, we also give the convergence analysis of the proposed methods. Furthermore, we use the proposed methods to solve a class of hybrid models combining the ROF model with the LLT model. Some numerical results demonstrate the viability and efficiency of the proposed methods. △ Less

Submitted 25 August, 2012; v1 submitted 9 October, 2011; originally announced October 2011.

Comments: Since we find that there are some unsuitale errors, I withdraw this paper from this website!

Showing 1–42 of 42 results for author: Pang, Z