Search | arXiv e-print repository

doi 10.1109/ICME55011.2023.00094

Generative Iris Prior Embedded Transformer for Iris Restoration

Authors: Yubo Huang, Jia Wang, Peipei Li, Liuyu Xiang, Peigang Li, Zhaofeng He

Abstract: Iris restoration from complexly degraded iris images, aiming to improve iris recognition performance, is a challenging problem. Due to the complex degradation, directly training a convolutional neural network (CNN) without prior cannot yield satisfactory results. In this work, we propose a generative iris prior embedded Transformer model (Gformer), in which we build a hierarchical encoder-decoder… ▽ More Iris restoration from complexly degraded iris images, aiming to improve iris recognition performance, is a challenging problem. Due to the complex degradation, directly training a convolutional neural network (CNN) without prior cannot yield satisfactory results. In this work, we propose a generative iris prior embedded Transformer model (Gformer), in which we build a hierarchical encoder-decoder network employing Transformer block and generative iris prior. First, we tame Transformer blocks to model long-range dependencies in target images. Second, we pretrain an iris generative adversarial network (GAN) to obtain the rich iris prior, and incorporate it into the iris restoration process with our iris feature modulator. Our experiments demonstrate that the proposed Gformer outperforms state-of-the-art methods. Besides, iris recognition performance has been significantly improved after applying Gformer. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Our code is available at https://github.com/sawyercharlton/Gformer

Journal ref: 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 2023, pp. 510-515

arXiv:2406.14903 [pdf, other]

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

Authors: Leyan Wang, Yonggang **, Tianhao Shen, Tianyu Zheng, Xinrun Du, Chenchen Zhang, Wenhao Huang, Jiaheng Liu, Shi Wang, Ge Zhang, Liuyu Xiang, Zhaofeng He

Abstract: As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individua… ▽ More As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individuals' group identities. To address this gap, we introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. This supports the ongoing development of empathetic LLM applications tailored to users with different identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives. This highlights the need for improved alignment of LLMs with diverse values to better accommodate the multifaceted nature of human identities. Our datasets are available at https://github.com/GIEBench/GIEBench. △ Less

Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.10305 [pdf]

Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models

Authors: Jie Chen, Xintian Han, Yu Ma, Xun Zhou, Liang Xiang

Abstract: Automatic code generation has been a longstanding research topic. With the advancement of general-purpose large language models (LLMs), the ability to code stands out as one important measure to the model's reasoning performance. Usually, a two-stage training paradigm is implemented to obtain a Code LLM, namely the pretraining and the fine-tuning. Within the fine-tuning, supervised fine-tuning (SF… ▽ More Automatic code generation has been a longstanding research topic. With the advancement of general-purpose large language models (LLMs), the ability to code stands out as one important measure to the model's reasoning performance. Usually, a two-stage training paradigm is implemented to obtain a Code LLM, namely the pretraining and the fine-tuning. Within the fine-tuning, supervised fine-tuning (SFT), and reinforcement learning (RL) are often used to improve the model's zero-shot ability. A large number of work has been conducted to improve the model's performance on code-related benchmarks with either modifications to the algorithm or refinement of the dataset. However, we still lack a deep insight into the correlation between SFT and RL. For instance, what kind of dataset should be used to ensure generalization, or what if we abandon the SFT phase in fine-tuning. In this work, we make an attempt to understand the correlation between SFT and RL. To facilitate our research, we manually craft 100 basis python functions, called atomic functions, and then a synthesizing pipeline is deployed to create a large number of synthetic functions on top of the atomic ones. In this manner, we ensure that the train and test sets remain distinct, preventing data contamination. Through comprehensive ablation study, we find: (1) Both atomic and synthetic functions are indispensable for SFT's generalization, and only a handful of synthetic functions are adequate; (2) Through RL, the SFT's generalization to target domain can be greatly enhanced, even with the same training prompts; (3) Training RL from scratch can alleviate the over-fitting issue introduced in the SFT phase. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.04721 [pdf, other]

End-to-End Design of Polar Coded Integrated Data and Energy Networking

Authors: Jie Hu, **gwen Cui, Kun Yang

Abstract: In order to transmit data and transfer energy to the low-power Internet of Things (IoT) devices, integrated data and energy networking (IDEN) system may be harnessed. In this context, we propose a bitwise end-to-end design for polar coded IDEN systems, where the conventional encoding/decoding, modulation/demodulation, and energy harvesting (EH) modules are replaced by the neural networks (NNs). In… ▽ More In order to transmit data and transfer energy to the low-power Internet of Things (IoT) devices, integrated data and energy networking (IDEN) system may be harnessed. In this context, we propose a bitwise end-to-end design for polar coded IDEN systems, where the conventional encoding/decoding, modulation/demodulation, and energy harvesting (EH) modules are replaced by the neural networks (NNs). In this way, the entire system can be treated as an AutoEncoder (AE) and trained in an end-to-end manner. Hence achieving global optimization. Additionally, we improve the common NN-based belief propagation (BP) decoder by adding an extra hypernetwork, which generates the corresponding NN weights for the main network under different number of iterations, thus the adaptability of the receiver architecture can be further enhanced. Our numerical results demonstrate that our BP-based end-to-end design is superior to conventional BP-based counterparts in terms of both the BER and power transfer, but it is inferior to the successive cancellation list (SCL)-based conventional IDEN system, which may be due to the inherent performance gap between the BP and SCL decoders. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.20234 [pdf, other]

Context Injection Attacks on Large Language Models

Authors: Cheng'an Wei, Kai Chen, Yue Zhao, Yujia Gong, Lu Xiang, Shenchen Zhu

Abstract: Large Language Models (LLMs) such as ChatGPT and Llama-2 have become prevalent in real-world applications, exhibiting impressive text generation performance. LLMs are fundamentally developed from a scenario where the input data remains static and lacks a clear structure. To behave interactively over time, LLM-based chat systems must integrate additional contextual information (i.e., chat history)… ▽ More Large Language Models (LLMs) such as ChatGPT and Llama-2 have become prevalent in real-world applications, exhibiting impressive text generation performance. LLMs are fundamentally developed from a scenario where the input data remains static and lacks a clear structure. To behave interactively over time, LLM-based chat systems must integrate additional contextual information (i.e., chat history) into their inputs, following a pre-defined structure. This paper identifies how such integration can expose LLMs to misleading context from untrusted sources and fail to differentiate between system and user inputs, allowing users to inject context. We present a systematic methodology for conducting context injection attacks aimed at eliciting disallowed responses by introducing fabricated context. This could lead to illegal actions, inappropriate content, or technology misuse. Our context fabrication strategies, acceptance elicitation and word anonymization, effectively create misleading contexts that can be structured with attacker-customized prompt templates, achieving injection through malicious user messages. Comprehensive evaluations on real-world LLMs such as ChatGPT and Llama-2 confirm the efficacy of the proposed attack with success rates reaching 97%. We also discuss potential countermeasures that can be adopted for attack detection and develo** more secure models. Our findings provide insights into the challenges associated with the real-world deployment of LLMs for interactive and structured data scenarios. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.18551 [pdf, other]

Photorealistic Robotic Simulation using Unreal Engine 5 for Agricultural Applications

Authors: Xingjian Li, Lirong Xiang

Abstract: This work presents a new robotics simulation environment built upon Unreal Engine 5 (UE5) for agricultural image data generation. The simulation utilizes the state-of-the-art real-time rendering engine to provide realistic plant images which are often used in agricultural applications. This study showcases the rendering accuracy of UE5 in comparison to existing tools and assesses its positional ac… ▽ More This work presents a new robotics simulation environment built upon Unreal Engine 5 (UE5) for agricultural image data generation. The simulation utilizes the state-of-the-art real-time rendering engine to provide realistic plant images which are often used in agricultural applications. This study showcases the rendering accuracy of UE5 in comparison to existing tools and assesses its positional accuracy when integrated with Robot Operating Systems (ROS). The results indicate that UE5 achieves an impressive average distance error of 0.021mm when compared to predetermined setpoints in a multi-robot setup involving two UR10 arms. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 3 pages, 4 figures, extended abstract accepted at IROS 2023 Workshop on Agricultural Robotics for a Sustainable Future (WARS_1)

arXiv:2405.16930 [pdf, other]

From Obstacle to Opportunity: Enhancing Semi-supervised Learning with Synthetic Data

Authors: Zerun Wang, Jiafeng Mao, Liuyu Xiang, Toshihiko Yamasaki

Abstract: Semi-supervised learning (SSL) can utilize unlabeled data to enhance model performance. In recent years, with increasingly powerful generative models becoming available, a large number of synthetic images have been uploaded to public image sets. Therefore, when collecting unlabeled data from these sources, the inclusion of synthetic images is inevitable. This prompts us to consider the impact of u… ▽ More Semi-supervised learning (SSL) can utilize unlabeled data to enhance model performance. In recent years, with increasingly powerful generative models becoming available, a large number of synthetic images have been uploaded to public image sets. Therefore, when collecting unlabeled data from these sources, the inclusion of synthetic images is inevitable. This prompts us to consider the impact of unlabeled data mixed with real and synthetic images on SSL. In this paper, we set up a new task, Real and Synthetic hybrid SSL (RS-SSL), to investigate this problem. We discover that current SSL methods are unable to fully utilize synthetic data and are sometimes negatively affected. Then, by analyzing the issues caused by synthetic images, we propose a new SSL method, RSMatch, to tackle the RS-SSL problem. Extensive experimental results show that RSMatch can better utilize the synthetic data in unlabeled images to improve the SSL performance. The effectiveness is further verified through ablation studies and visualization. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.15157 [pdf, other]

Rethinking Class-Incremental Learning from a Dynamic Imbalanced Learning Perspective

Authors: Leyuan Wang, Liuyu Xiang, Yunlong Wang, Huijia Wu, Zhaofeng He

Abstract: Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dyna… ▽ More Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dynamic imbalance issue, we propose Uniform Prototype Contrastive Learning (UPCL), where uniform and compact features are learned. Specifically, we generate a set of non-learnable uniform prototypes before each task starts. Then we assign these uniform prototypes to each class and guide the feature learning through prototype contrastive learning. We also dynamically adjust the relative margin between old and new classes so that the feature distribution will be maintained balanced and compact. Finally, we demonstrate through extensive experiments that the proposed method achieves state-of-the-art performance on several benchmark datasets including CIFAR100, ImageNet100 and TinyImageNet. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.15155 [pdf, other]

CLIP model is an Efficient Online Lifelong Learner

Authors: Leyuan Wang, Liuyu Xiang, Yujie Wei, Yunlong Wang, Zhaofeng He

Abstract: Online Lifelong Learning (OLL) addresses the challenge of learning from continuous and non-stationary data streams. Existing online lifelong learning methods based on image classification models often require preset conditions such as the total number of classes or maximum memory capacity, which hinders the realization of real never-ending learning and renders them impractical for real-world scena… ▽ More Online Lifelong Learning (OLL) addresses the challenge of learning from continuous and non-stationary data streams. Existing online lifelong learning methods based on image classification models often require preset conditions such as the total number of classes or maximum memory capacity, which hinders the realization of real never-ending learning and renders them impractical for real-world scenarios. In this work, we propose that vision-language models, such as Contrastive Language-Image Pretraining (CLIP), are more suitable candidates for online lifelong learning. We discover that maintaining symmetry between image and text is crucial during Parameter-Efficient Tuning (PET) for CLIP model in online lifelong learning. To this end, we introduce the Symmetric Image-Text (SIT) tuning strategy. We conduct extensive experiments on multiple lifelong learning benchmark datasets and elucidate the effectiveness of SIT through gradient analysis. Additionally, we assess the impact of lifelong learning on generalizability of CLIP and found that tuning the image encoder is beneficial for lifelong learning, while tuning the text encoder aids in zero-shot learning. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.07181 [pdf, other]

BAMBOO: a predictive and transferable machine learning force field framework for liquid electrolyte development

Authors: Sheng Gong, Yumin Zhang, Zhenliang Mu, Zhichen Pu, Hongyi Wang, Zhiao Yu, Mengyi Chen, Tianze Zheng, Zhi Wang, Lifei Chen, Xiaojie Wu, Shaochen Shi, Weihao Gao, Wen Yan, Liang Xiang

Abstract: Despite the widespread applications of machine learning force field (MLFF) on solids and small molecules, there is a notable gap in applying MLFF to complex liquid electrolytes. In this work, we introduce BAMBOO (ByteDance AI Molecular Simulation Booster), a novel framework for molecular dynamics (MD) simulations, with a demonstration of its capabilities in the context of liquid electrolytes for l… ▽ More Despite the widespread applications of machine learning force field (MLFF) on solids and small molecules, there is a notable gap in applying MLFF to complex liquid electrolytes. In this work, we introduce BAMBOO (ByteDance AI Molecular Simulation Booster), a novel framework for molecular dynamics (MD) simulations, with a demonstration of its capabilities in the context of liquid electrolytes for lithium batteries. We design a physics-inspired graph equivariant transformer architecture as the backbone of BAMBOO to learn from quantum mechanical simulations. Additionally, we pioneer an ensemble knowledge distillation approach and apply it on MLFFs to improve the stability of MD simulations. Finally, we propose the density alignment algorithm to align BAMBOO with experimental measurements. BAMBOO demonstrates state-of-the-art accuracy in predicting key electrolyte properties such as density, viscosity, and ionic conductivity across various solvents and salt combinations. Our current model, trained on more than 15 chemical species, achieves the average density error of 0.01 g/cm$^3$ on various compositions compared with experimental data. Moreover, our model demonstrates transferability to molecules not included in the quantum mechanical dataset. We envision this work as paving the way to a "universal MLFF" capable of simulating properties of common organic liquids. △ Less

Submitted 22 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.07084 [pdf, other]

Dynamic Generation of Personalities with Large Language Models

Authors: Jianzhi Liu, Hexiang Gu, Tianyu Zheng, Liuyu Xiang, Huijia Wu, Jie Fu, Zhaofeng He

Abstract: In the realm of mimicking human deliberation, large language models (LLMs) show promising performance, thereby amplifying the importance of this research area. Deliberation is influenced by both logic and personality. However, previous studies predominantly focused on the logic of LLMs, neglecting the exploration of personality aspects. In this work, we introduce Dynamic Personality Generation (DP… ▽ More In the realm of mimicking human deliberation, large language models (LLMs) show promising performance, thereby amplifying the importance of this research area. Deliberation is influenced by both logic and personality. However, previous studies predominantly focused on the logic of LLMs, neglecting the exploration of personality aspects. In this work, we introduce Dynamic Personality Generation (DPG), a dynamic personality generation method based on Hypernetworks. Initially, we embed the Big Five personality theory into GPT-4 to form a personality assessment machine, enabling it to evaluate characters' personality traits from dialogues automatically. We propose a new metric to assess personality generation capability based on this evaluation method. Then, we use this personality assessment machine to evaluate dialogues in script data, resulting in a personality-dialogue dataset. Finally, we fine-tune DPG on the personality-dialogue dataset. Experiments prove that DPG's personality generation capability is stronger after fine-tuning on this dataset than traditional fine-tuning methods, surpassing prompt-based GPT-4. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2403.10978 [pdf, other]

Entity Alignment with Unlabeled Dangling Cases

Authors: Hang Yin, Dong Ding, Liyao Xiang, Yuheng He, Yihan Wu, Xinbing Wang, Chenghu Zhou

Abstract: We investigate the entity alignment problem with unlabeled dangling cases, meaning that there are entities in the source or target graph having no counterparts in the other, and those entities remain unlabeled. The problem arises when the source and target graphs are of different scales, and it is much cheaper to label the matchable pairs than the dangling entities. To solve the issue, we propose… ▽ More We investigate the entity alignment problem with unlabeled dangling cases, meaning that there are entities in the source or target graph having no counterparts in the other, and those entities remain unlabeled. The problem arises when the source and target graphs are of different scales, and it is much cheaper to label the matchable pairs than the dangling entities. To solve the issue, we propose a novel GNN-based dangling detection and entity alignment framework. While the two tasks share the same GNN and are trained together, the detected dangling entities are removed in the alignment. Our framework is featured by a designed entity and relation attention mechanism for selective neighborhood aggregation in representation learning, as well as a positive-unlabeled learning loss for an unbiased estimation of dangling entities. Experimental results have shown that each component of our design contributes to the overall alignment performance which is comparable or superior to baselines, even if the baselines additionally have 30\% of the dangling entities labeled as training data. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: 14 pages

ACM Class: I.2.4; H.3.3

arXiv:2403.05842 [pdf, other]

Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance

Authors: Hengyuan Xu, Liyao Xiang, Xingjun Ma, Borui Yang, Baochun Li

Abstract: With the blossom of deep learning models and services, it has become an imperative concern to safeguard the valuable model parameters from being stolen. Watermarking is considered an important tool for ownership verification. However, current watermarking schemes are customized for different models and tasks, hard to be integrated as an integrated intellectual protection service. We propose Hufu,… ▽ More With the blossom of deep learning models and services, it has become an imperative concern to safeguard the valuable model parameters from being stolen. Watermarking is considered an important tool for ownership verification. However, current watermarking schemes are customized for different models and tasks, hard to be integrated as an integrated intellectual protection service. We propose Hufu, a modality-agnostic watermarking system for pre-trained Transformer-based models, relying on the permutation equivariance property of Transformers. Hufu embeds watermark by fine-tuning the pre-trained model on a set of data samples specifically permuted, and the embedded model essentially contains two sets of weights -- one for normal use and the other for watermark extraction which is triggered on permuted inputs. The permutation equivariance ensures minimal interference between these two sets of model weights and thus high fidelity on downstream tasks. Since our method only depends on the model itself, it is naturally modality-agnostic, task-independent, and trigger-sample-free. Extensive experiments on the state-of-the-art vision Transformers, BERT, and GPT2 have demonstrated Hufu's superiority in meeting watermarking requirements including effectiveness, efficiency, fidelity, and robustness, showing its great potential to be deployed as a uniform ownership verification service for various Transformers. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.02576 [pdf, other]

AceMap: Knowledge Discovery through Academic Graph

Authors: Xinbing Wang, Luoyi Fu, Xiaoying Gan, Ying Wen, Guanjie Zheng, Jiaxin Ding, Liyao Xiang, Nanyang Ye, Meng **, Shiyu Liang, Bin Lu, Haiwen Wang, Yi Xu, Cheng Deng, Shao Zhang, Huquan Kang, Xingli Wang, Qi Li, Zhixin Guo, Jiexing Qi, Pan Liu, Yuyang Ren, Lyuwen Wu, Jungang Yang, Jian** Zhou , et al. (1 additional authors not shown)

Abstract: The exponential growth of scientific literature requires effective management and extraction of valuable insights. While existing scientific search engines excel at delivering search results based on relational databases, they often neglect the analysis of collaborations between scientific entities and the evolution of ideas, as well as the in-depth analysis of content within scientific publicatio… ▽ More The exponential growth of scientific literature requires effective management and extraction of valuable insights. While existing scientific search engines excel at delivering search results based on relational databases, they often neglect the analysis of collaborations between scientific entities and the evolution of ideas, as well as the in-depth analysis of content within scientific publications. The representation of heterogeneous graphs and the effective measurement, analysis, and mining of such graphs pose significant challenges. To address these challenges, we present AceMap, an academic system designed for knowledge discovery through academic graph. We present advanced database construction techniques to build the comprehensive AceMap database with large-scale academic entities that contain rich visual, textual, and numerical information. AceMap also employs innovative visualization, quantification, and analysis methods to explore associations and logical relationships among academic entities. AceMap introduces large-scale academic network visualization techniques centered on nebular graphs, providing a comprehensive view of academic networks from multiple perspectives. In addition, AceMap proposes a unified metric based on structural entropy to quantitatively measure the knowledge content of different academic entities. Moreover, AceMap provides advanced analysis capabilities, including tracing the evolution of academic ideas through citation relationships and concept co-occurrence, and generating concise summaries informed by this evolutionary process. In addition, AceMap uses machine reading methods to generate potential new ideas at the intersection of different fields. Exploring the integration of large language models and knowledge graphs is a promising direction for future research in idea evolution. Please visit \url{https://www.acemap.info} for further exploration. △ Less

Submitted 14 April, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: Technical Report for AceMap (https://www.acemap.info)

arXiv:2403.00839 [pdf, other]

ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph

Authors: Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, Dongkuan Xu

Abstract: While achieving remarkable progress in a broad range of tasks, large language models (LLMs) remain significantly limited in properly using massive external tools. Existing in-context learning approaches simply format tools into a list of plain text descriptions and input them to LLMs, from which, LLMs generate a sequence of tool calls to solve problems step by step. Such a paradigm ignores the int… ▽ More While achieving remarkable progress in a broad range of tasks, large language models (LLMs) remain significantly limited in properly using massive external tools. Existing in-context learning approaches simply format tools into a list of plain text descriptions and input them to LLMs, from which, LLMs generate a sequence of tool calls to solve problems step by step. Such a paradigm ignores the intrinsic dependency between tools and offloads all reasoning loads to LLMs, making them restricted to a limited number of specifically designed tools. It thus remains challenging for LLMs to operate on a library of massive tools, casting a great limitation when confronted with real-world scenarios. This paper proposes ToolNet, a plug-and-play framework that scales up the number of tools to thousands with a moderate increase in token consumption. ToolNet organizes tools into a directed graph. Each node represents a tool, and weighted edges denote tool transition. Starting from an initial tool node, an LLM navigates in the graph by iteratively choosing the next one from its successors until the task is resolved. Extensive experiments show that ToolNet can achieve impressive results in challenging multi-hop tool learning datasets and is resilient to tool failures. △ Less

Submitted 28 February, 2024; originally announced March 2024.

arXiv:2402.18821 [pdf, other]

Debiased Novel Category Discovering and Localization

Authors: Juexiao Feng, Yuhong Yang, Yanchun Xie, Yaqian Li, Yandong Guo, Yuchen Guo, Yuwei He, Liuyu Xiang, Guiguang Ding

Abstract: In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this… ▽ More In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI 2024

arXiv:2402.15627 [pdf, other]

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Authors: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao , et al. (7 additional authors not shown)

Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model bl… ▽ More We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlap**, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.04154 [pdf, other]

Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

Authors: Yonggang **, Ge Zhang, Hao Zhao, Tianyu Zheng, Jarvi Guo, Liuyu Xiang, Shawn Yue, Stephen W. Huang, Zhaofeng He, Jie Fu

Abstract: Develo** a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajec… ▽ More Develo** a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities. △ Less

Submitted 5 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.13154 [pdf, other]

Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration

Authors: Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, Jia Rao, Yifan Yuan, Ren Wang

Abstract: With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate da… ▽ More With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate data spilled from fast memory. While the existing research has demonstrated the effectiveness of various optimizations on page migration, it falls short of addressing a fundamental question: Is exclusive memory tiering, in which a page is either present in fast memory or slow memory, but not both simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers significant performance degradation when fast memory is under pressure. In this paper, we propose non-exclusive memory tiering, a page management strategy that retains a copy of pages recently promoted from slow memory to fast memory to mitigate memory thrashing. To enable non-exclusive memory tiering, we develop Nomad, a new page management mechanism for Linux that features transactional page migration and page shadowing. Nomad helps remove page migration off the critical path of program execution and makes migration completely asynchronous. Evaluations with carefully crafted micro-benchmarks and real-world applications show that Nomad is able to achieve up to 6x performance improvement over the state-of-the-art transparent page placement (TPP) approach in Linux when under memory pressure. We also compare Nomad with a recently proposed hardware-assisted, access sampling-based page migration approach and demonstrate Nomad's strengths and potential weaknesses in various scenarios. △ Less

Submitted 17 June, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.07205 [pdf, other]

Crafter: Facial Feature Crafting against Inversion-based Identity Theft on Deep Models

Authors: Shiming Wang, Zhe Ji, Liyao Xiang, Hao Zhang, Xinbing Wang, Chenghu Zhou, Bo Li

Abstract: With the increased capabilities at the edge (e.g., mobile device) and more stringent privacy requirement, it becomes a recent trend for deep learning-enabled applications to pre-process sensitive raw data at the edge and transmit the features to the backend cloud for further processing. A typical application is to run machine learning (ML) services on facial images collected from different individ… ▽ More With the increased capabilities at the edge (e.g., mobile device) and more stringent privacy requirement, it becomes a recent trend for deep learning-enabled applications to pre-process sensitive raw data at the edge and transmit the features to the backend cloud for further processing. A typical application is to run machine learning (ML) services on facial images collected from different individuals. To prevent identity theft, conventional methods commonly rely on an adversarial game-based approach to shed the identity information from the feature. However, such methods can not defend against adaptive attacks, in which an attacker takes a countermove against a known defence strategy. We propose Crafter, a feature crafting mechanism deployed at the edge, to protect the identity information from adaptive model inversion attacks while ensuring the ML tasks are properly carried out in the cloud. The key defence strategy is to mislead the attacker to a non-private prior from which the attacker gains little about the private identity. In this case, the crafted features act like poison training samples for attackers with adaptive model updates. Experimental results indicate that Crafter successfully defends both basic and possible adaptive attacks, which can not be achieved by state-of-the-art adversarial game-based methods. △ Less

Submitted 14 January, 2024; originally announced January 2024.

arXiv:2312.15742 [pdf, other]

DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Authors: Li Xiang, Junbo Yin, Wei Li, Cheng-Zhong Xu, Ruigang Yang, Jianbing Shen

Abstract: Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of eac… ▽ More Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlap** regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X △ Less

Submitted 25 December, 2023; originally announced December 2023.

Comments: aaai2024

arXiv:2312.08594 [pdf, other]

CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer

Authors: Sicheng Wang, Hao Jiang, Lei Xiang

Abstract: Recent deep multi-view stereo (MVS) methods have widely incorporated transformers into cascade network for high-resolution depth estimation, achieving impressive results. However, existing transformer-based methods are constrained by their computational costs, preventing their extension to finer stages. In this paper, we propose a novel cross-scale transformer (CT) that processes feature represent… ▽ More Recent deep multi-view stereo (MVS) methods have widely incorporated transformers into cascade network for high-resolution depth estimation, achieving impressive results. However, existing transformer-based methods are constrained by their computational costs, preventing their extension to finer stages. In this paper, we propose a novel cross-scale transformer (CT) that processes feature representations at different stages without additional computation. Specifically, we introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. This combined strategy enables our network to capture intra-image context information and enhance inter-image feature relationships. Besides, we present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction to further strengthen global and local feature awareness. Meanwhile, we design a feature metric loss (FM Loss) that evaluates the feature bias before and after transformation to reduce the impact of feature mismatch on depth estimation. Extensive experiments on DTU dataset and Tanks and Temples (T\&T) benchmark demonstrate that our method achieves state-of-the-art results. Code is available at https://github.com/wscstrive/CT-MVSNet. △ Less

Submitted 1 February, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted at the 30th International Conference on Multimedia Modeling(MMM'24 Oral)

arXiv:2311.08024 [pdf, other]

MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with Semi Supervised Learning for Low Dose CT

Authors: Tao Song, Ruizhi Hou, Lisong Dai, Lei Xiang

Abstract: Image quality assessment (IQA) plays a critical role in optimizing radiation dose and develo** novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potentia… ▽ More Image quality assessment (IQA) plays a critical role in optimizing radiation dose and develo** novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: https://github.com/zunzhumu/MD-IQA. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.07106 [pdf, other]

A Tutorial on Coding Methods for DNA-based Molecular Communications and Storage

Authors: Qiang Liu, Sirong Chen, Kang Yan, Wenfeng Wu, Kun Yang

Abstract: Exponential increase of data has motivated advances of data storage technologies. As a promising storage media, DeoxyriboNucleic Acid (DNA) storage provides a much higher data density and superior durability, compared with state-of-the-art media. In this paper, we provide a tutorial on DNA storage and its role in molecular communications. Firstly, we introduce fundamentals of DNA-based molecular c… ▽ More Exponential increase of data has motivated advances of data storage technologies. As a promising storage media, DeoxyriboNucleic Acid (DNA) storage provides a much higher data density and superior durability, compared with state-of-the-art media. In this paper, we provide a tutorial on DNA storage and its role in molecular communications. Firstly, we introduce fundamentals of DNA-based molecular communications and storage (MCS), discussing the basic process of performing DNA storage in MCS. Furthermore, we provide tutorials on how conventional coding schemes that are used in wireless communications can be applied to DNA-based MCS, along with numerical results. Finally, promising research directions on DNA-based data storage in molecular communications are introduced and discussed in this paper. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.01122 [pdf, other]

Deep Joint Source-Channel Coding for DNA Image Storage: A Novel Approach with Enhanced Error Resilience and Biological Constraint Optimization

Authors: Wenfeng Wu, Qiang Liu, Kun Yang

Abstract: In the current era, DeoxyriboNucleic Acid (DNA) based data storage emerges as an intriguing approach, garnering substantial academic interest and investigation. This paper introduces a novel deep joint source-channel coding (DJSCC) scheme for DNA image storage, designated as DJSCC-DNA. This paradigm distinguishes itself from conventional DNA storage techniques through three key modifications: 1) i… ▽ More In the current era, DeoxyriboNucleic Acid (DNA) based data storage emerges as an intriguing approach, garnering substantial academic interest and investigation. This paper introduces a novel deep joint source-channel coding (DJSCC) scheme for DNA image storage, designated as DJSCC-DNA. This paradigm distinguishes itself from conventional DNA storage techniques through three key modifications: 1) it employs advanced deep learning methodologies, employing convolutional neural networks for DNA encoding and decoding processes; 2) it seamlessly integrates DNA polymerase chain reaction (PCR) amplification into the network architecture, thereby augmenting data recovery precision; and 3) it restructures the loss function by targeting biological constraints for optimization. The performance of the proposed model is demonstrated via numerical results from specific channel testing, suggesting that it surpasses conventional deep learning methodologies in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). Additionally, the model effectively ensures positive constraints on both homopolymer run-length and GC content. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2310.20200 [pdf, other]

Multi-Domain Polarization for Enhancing the Physical Layer Security of MIMO Systems

Authors: Yao Zeng, Jie Hu, Kun Yang, Lajos Hanzo

Abstract: A novel Physical Layer Security (PLS) framework is conceived for enhancing the security of the wireless communication systems by exploiting multi-domain polarization in Multiple-Input Multiple-Output (MIMO) systems. We design a sophisticated key generation scheme based on multi-domain polarization, and the corresponding receivers. An in-depth analysis of the system's secrecy rate is provided, demo… ▽ More A novel Physical Layer Security (PLS) framework is conceived for enhancing the security of the wireless communication systems by exploiting multi-domain polarization in Multiple-Input Multiple-Output (MIMO) systems. We design a sophisticated key generation scheme based on multi-domain polarization, and the corresponding receivers. An in-depth analysis of the system's secrecy rate is provided, demonstrating the confidentiality of our approach in the presence of eavesdroppers having strong computational capabilities. More explicitly, our simulation results and theoretical analysis corroborate the advantages of the proposed scheme in terms of its bit error rate (BER), block error rate (BLER), and maximum achievable secrecy rate. Our findings indicate that the innovative PLS framework effectively enhances the security and reliability of wireless communication systems. For instance, in a $4\times4$ MIMO setup, the proposed PLS strategy exhibits an improvement of $2$dB compared to conventional MIMO, systems at a BLER of $2\cdot 10^{-5}$ while the eavesdropper's BLER reaches $1$. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.13335 [pdf, other]

doi 10.1109/TCOMM.2023.3337257

Reconfigurable Intelligent Sensing Surface aided Wireless Powered Communication Networks: A Sensing-Then-Reflecting Approach

Authors: Cheng Luo, Jie Hu, Kun Yang

Abstract: This paper presents a reconfigurable intelligent sensing surface (RISS) that combines passive and active elements to achieve simultaneous reflection and direction of arrival (DOA) estimation tasks. By utilizing DOA information from the RISS instead of conventional channel estimation, the pilot overhead is reduced and the RISS becomes independent of the hybrid access point (HAP), enabling efficient… ▽ More This paper presents a reconfigurable intelligent sensing surface (RISS) that combines passive and active elements to achieve simultaneous reflection and direction of arrival (DOA) estimation tasks. By utilizing DOA information from the RISS instead of conventional channel estimation, the pilot overhead is reduced and the RISS becomes independent of the hybrid access point (HAP), enabling efficient operation. Specifically, the RISS autonomously estimates the DOA of uplink signals from single-antenna users and reflects them using the HAP's slowly varying DOA information. During downlink transmission, it updates the HAP's DOA information and designs the reflection phase of energy signals based on the latest user DOA information. The paper includes a comprehensive performance analysis, covering system design, protocol details, receiving performance, and RISS deployment suggestions. We derive a closed-form expression to analyze system performance under DOA errors, and calculate the statistical distribution of user received energy using the moment-matching technique. We provide a recommended transmit power to meet a specified outage probability and energy threshold. Numerical results demonstrate that the proposed system outperforms the conventional counterpart by 2.3 dB and 4.7 dB for Rician factors $κ_h=κ_G=1$ and $κ_h=κ_G=10$, respectively. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2309.13833 [pdf, other]

Dual Feature Augmentation Network for Generalized Zero-shot Learning

Authors: Lei Xiang, Yuan Zhou, Haoran Duan, Yang Long

Abstract: Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes. Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image. However, these methods often ignore the complex entanglement among different attributes' visual features in the embedding space. Additionally, these methods empl… ▽ More Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes. Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image. However, these methods often ignore the complex entanglement among different attributes' visual features in the embedding space. Additionally, these methods employ a direct attribute prediction scheme for classification, which does not account for the diversity of attributes in images of the same category. To address these issues, we propose a novel Dual Feature Augmentation Network (DFAN), which comprises two feature augmentation modules, one for visual features and the other for semantic features. The visual feature augmentation module explicitly learns attribute features and employs cosine distance to separate them, thus enhancing attribute representation. In the semantic feature augmentation module, we propose a bias learner to capture the offset that bridges the gap between actual and predicted attribute values from a dataset's perspective. Furthermore, we introduce two predictors to reconcile the conflicts between local and global features. Experimental results on three benchmarks demonstrate the marked advancement of our method compared to state-of-the-art approaches. Our code is available at https://github.com/Sion1/DFAN. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: Accepted to BMVC2023

arXiv:2309.00860 [pdf, other]

Towards Code Watermarking with Dual-Channel Transformations

Authors: Borui Yang, Wei Li, Liyao Xiang, Bo Li

Abstract: The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in which watermarking is a major technique. Yet, drasti… ▽ More The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in which watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements. △ Less

Submitted 1 January, 2024; v1 submitted 2 September, 2023; originally announced September 2023.

Comments: 16 pages, accepted by IEEE S&P 2024

arXiv:2308.15143 [pdf, other]

Lifelike Agility and Play on Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models

Authors: Lei Han, Qingxu Zhu, Jiapeng Sheng, Chong Zhang, Tingguang Li, Yizheng Zhang, He Zhang, Yuzhen Liu, Cheng Zhou, Rui Zhao, Jie Li, Yufeng Zhang, Rui Wang, Wanchao Chi, Xiong Li, Yonghui Zhu, Lingzhu Xiang, Xiao Teng, Zhengyou Zhang

Abstract: Summarizing knowledge from animals and human beings inspires robotic innovations. In this work, we propose a framework for driving legged robots act like real animals with lifelike agility and strategy in complex environments. Inspired by large pre-trained models witnessed with impressive performance in language and image understanding, we introduce the power of advanced deep generative models to… ▽ More Summarizing knowledge from animals and human beings inspires robotic innovations. In this work, we propose a framework for driving legged robots act like real animals with lifelike agility and strategy in complex environments. Inspired by large pre-trained models witnessed with impressive performance in language and image understanding, we introduce the power of advanced deep generative models to produce motor control signals stimulating legged robots to act like real animals. Unlike conventional controllers and end-to-end RL methods that are task-specific, we propose to pre-train generative models over animal motion datasets to preserve expressive knowledge of animal behavior. The pre-trained model holds sufficient primitive-level knowledge yet is environment-agnostic. It is then reused for a successive stage of learning to align with the environments by traversing a number of challenging obstacles that are rarely considered in previous approaches, including cree** through narrow spaces, jum** over hurdles, freerunning over scattered blocks, etc. Finally, a task-specific controller is trained to solve complex downstream tasks by reusing the knowledge from previous stages. Enriching the knowledge regarding each stage does not affect the usage of other levels of knowledge. This flexible framework offers the possibility of continual knowledge accumulation at different levels. We successfully apply the trained multi-level controllers to the MAX robot, a quadrupedal robot developed in-house, to mimic animals, traverse complex obstacles, and play in a designed challenging multi-agent Chase Tag Game, where lifelike agility and strategy emerge on the robots. The present research pushes the frontier of robot control with new insights on reusing multi-level pre-trained knowledge and solving highly complex downstream tasks in the real world. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.06668 [pdf, other]

Large Language Models and Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

Authors: Jiajia Li, Mingle Xu, Lirong Xiang, Dong Chen, Weichao Zhuang, Xunyuan Yin, Zhaojian Li

Abstract: The past decade has witnessed the rapid development and adoption of ML & DL methodologies in agricultural systems, showcased by great successes in agricultural applications. However, these conventional ML/DL models have certain limitations: they heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tail… ▽ More The past decade has witnessed the rapid development and adoption of ML & DL methodologies in agricultural systems, showcased by great successes in agricultural applications. However, these conventional ML/DL models have certain limitations: they heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tailored for specific tasks, thus lacking generalizability. Recently, large pre-trained models, also known as FMs, have demonstrated remarkable successes in language, vision, and decision-making tasks across various domains. These models are trained on a large amount of data from multiple domains and modalities. Once trained, they can accomplish versatile tasks with just minor fine-tuning and minimal task-specific labeled data. Despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture AI. Thus, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, conceptual tools and technical background are presented to help the understanding of the problem space and uncover new research directions. To this end, recent FMs in the general CS domain are reviewed, and the models are categorized into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Then, the steps of develo** agriculture FMs (AFMs) are outlined and potential applications in smart agriculture are discussed. Moreover, challenges and risks associated with develo** AFMs are discussed, including model training, validation, and deployment. In summary, the advancement of AI in agriculture is explored by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems. △ Less

Submitted 17 March, 2024; v1 submitted 12 August, 2023; originally announced August 2023.

Comments: 18 pages, 3 figures

arXiv:2307.14487 [pdf, other]

Technical note: ShinyAnimalCV: open-source cloud-based web application for object detection, segmentation, and three-dimensional visualization of animals using computer vision

Authors: ** Wang, Yu Hu, Lirong Xiang, Gota Morota, Samantha A. Brooks, Carissa L. Wickens, Emily K. Miller-Cushon, Haipeng Yu

Abstract: Computer vision (CV), a non-intrusive and cost-effective technology, has furthered the development of precision livestock farming by enabling optimized decision-making through timely and individualized animal care. The availability of affordable two- and three-dimensional camera sensors, combined with various machine learning and deep learning algorithms, has provided a valuable opportunity to imp… ▽ More Computer vision (CV), a non-intrusive and cost-effective technology, has furthered the development of precision livestock farming by enabling optimized decision-making through timely and individualized animal care. The availability of affordable two- and three-dimensional camera sensors, combined with various machine learning and deep learning algorithms, has provided a valuable opportunity to improve livestock production systems. However, despite the availability of various CV tools in the public domain, applying these tools to animal data can be challenging, often requiring users to have programming and data analysis skills, as well as access to computing resources. Moreover, the rapid expansion of precision livestock farming is creating a growing need to educate and train animal science students in CV. This presents educators with the challenge of efficiently demonstrating the complex algorithms involved in CV. Thus, the objective of this study was to develop ShinyAnimalCV, an open-source cloud-based web application. This application provides a user-friendly interface for performing CV tasks, including object segmentation, detection, three-dimensional surface visualization, and extraction of two- and three-dimensional morphological features. Nine pre-trained CV models using top-view animal data are included in the application. ShinyAnimalCV has been deployed online using cloud computing platforms. The source code of ShinyAnimalCV is available on GitHub, along with detailed documentation on training CV models using custom data and deploying ShinyAnimalCV locally to allow users to fully leverage the capabilities of the application. ShinyAnimalCV can contribute to CV research and teaching in the animal science community. △ Less

Submitted 26 July, 2023; originally announced July 2023.

arXiv:2307.02988 [pdf, other]

UAV Swarms for Joint Data Ferrying and Dynamic Cell Coverage via Optimal Transport Descent and Quadratic Assignment

Authors: Kai Cui, Lars Baumgärtner, Burak Yilmaz, Mengguang Li, Christian Fabian, Benjamin Becker, Lin Xiang, Maximilian Bauer, Heinz Koeppl

Abstract: Both data ferrying with disruption-tolerant networking (DTN) and mobile cellular base stations constitute important techniques for UAV-aided communication in situations of crises where standard communication infrastructure is unavailable. For optimal use of a limited number of UAVs, we propose providing both DTN and a cellular base station on each UAV. Here, DTN is used for large amounts of low-pr… ▽ More Both data ferrying with disruption-tolerant networking (DTN) and mobile cellular base stations constitute important techniques for UAV-aided communication in situations of crises where standard communication infrastructure is unavailable. For optimal use of a limited number of UAVs, we propose providing both DTN and a cellular base station on each UAV. Here, DTN is used for large amounts of low-priority data, while capacity-constrained cell coverage remains reserved for emergency calls or command and control. We optimize cell coverage via a novel optimal transport-based formulation using alternating minimization, while for data ferrying we periodically deliver data between dynamic clusters by solving quadratic assignment problems. In our evaluation, we consider different scenarios with varying mobility models and a wide range of flight patterns. Overall, we tractably achieve optimal cell coverage under quality-of-service costs with DTN-based data ferrying, enabling large-scale deployment of UAV swarms for crisis communication. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: Accepted to IEEE LCN 2023 as full paper, pre-final version

arXiv:2306.16077 [pdf, other]

Secure and Fast Asynchronous Vertical Federated Learning via Cascaded Hybrid Optimization

Authors: Ganyu Wang, Qingsong Zhang, Li Xiang, Boyu Wang, Bin Gu, Charles Ling

Abstract: Vertical Federated Learning (VFL) attracts increasing attention because it empowers multiple parties to jointly train a privacy-preserving model over vertically partitioned data. Recent research has shown that applying zeroth-order optimization (ZOO) has many advantages in building a practical VFL algorithm. However, a vital problem with the ZOO-based VFL is its slow convergence rate, which limits… ▽ More Vertical Federated Learning (VFL) attracts increasing attention because it empowers multiple parties to jointly train a privacy-preserving model over vertically partitioned data. Recent research has shown that applying zeroth-order optimization (ZOO) has many advantages in building a practical VFL algorithm. However, a vital problem with the ZOO-based VFL is its slow convergence rate, which limits its application in handling modern large models. To address this problem, we propose a cascaded hybrid optimization method in VFL. In this method, the downstream models (clients) are trained with ZOO to protect privacy and ensure that no internal information is shared. Meanwhile, the upstream model (server) is updated with first-order optimization (FOO) locally, which significantly improves the convergence rate, making it feasible to train the large models without compromising privacy and security. We theoretically prove that our VFL framework converges faster than the ZOO-based VFL, as the convergence of our framework is not limited by the size of the server model, making it effective for training large models with the major part on the server. Extensive experiments demonstrate that our method achieves faster convergence than the ZOO-based VFL framework, while maintaining an equivalent level of privacy protection. Moreover, we show that the convergence of our VFL is comparable to the unsafe FOO-based VFL baseline. Additionally, we demonstrate that our method makes the training of a large model feasible. △ Less

Submitted 29 June, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: Under Review

arXiv:2306.10698 [pdf, other]

Deep Reinforcement Learning with Task-Adaptive Retrieval via Hypernetwork

Authors: Yonggang **, Chenxu Wang, Tianyu Zheng, Liuyu Xiang, Yaodong Yang, Junge Zhang, Jie Fu, Zhaofeng He

Abstract: Deep reinforcement learning algorithms are usually impeded by sampling inefficiency, heavily depending on multiple interactions with the environment to acquire accurate decision-making capabilities. In contrast, humans rely on their hippocampus to retrieve relevant information from past experiences of relevant tasks, which guides their decision-making when learning a new task, rather than exclusiv… ▽ More Deep reinforcement learning algorithms are usually impeded by sampling inefficiency, heavily depending on multiple interactions with the environment to acquire accurate decision-making capabilities. In contrast, humans rely on their hippocampus to retrieve relevant information from past experiences of relevant tasks, which guides their decision-making when learning a new task, rather than exclusively depending on environmental interactions. Nevertheless, designing a hippocampus-like module for an agent to incorporate past experiences into established reinforcement learning algorithms presents two challenges. The first challenge involves selecting the most relevant past experiences for the current task, and the second challenge is integrating such experiences into the decision network. To address these challenges, we propose a novel method that utilizes a retrieval network based on task-conditioned hypernetwork, which adapts the retrieval network's parameters depending on the task. At the same time, a dynamic modification mechanism enhances the collaborative efforts between the retrieval and decision networks. We evaluate the proposed method across various tasks within a multitask scenario in the Minigrid environment. The experimental results demonstrate that our proposed method significantly outperforms strong baselines. △ Less

Submitted 6 March, 2024; v1 submitted 19 June, 2023; originally announced June 2023.

arXiv:2305.13802 [pdf, other]

Online Open-set Semi-supervised Object Detection with Dual Competing Head

Authors: Zerun Wang, Ling Xiao, Liuyu Xiang, Zhaotian Weng, Toshihiko Yamasaki

Abstract: Open-set semi-supervised object detection (OSSOD) task leverages practical open-set unlabeled datasets that comprise both in-distribution (ID) and out-of-distribution (OOD) instances for conducting semi-supervised object detection (SSOD). The main challenge in OSSOD is distinguishing and filtering the OOD instances (i.e., outliers) during pseudo-labeling since OODs will affect the performance. The… ▽ More Open-set semi-supervised object detection (OSSOD) task leverages practical open-set unlabeled datasets that comprise both in-distribution (ID) and out-of-distribution (OOD) instances for conducting semi-supervised object detection (SSOD). The main challenge in OSSOD is distinguishing and filtering the OOD instances (i.e., outliers) during pseudo-labeling since OODs will affect the performance. The only OSSOD work employs an additional offline OOD detection network trained solely with labeled data to solve this problem. However, the limited labeled data restricts the potential for improvement. Meanwhile, the offline strategy results in low efficiency. To alleviate these issues, this paper proposes an end-to-end online OSSOD framework that improves performance and efficiency: 1) We propose a semi-supervised outlier filtering method that more effectively filters the OOD instances using both labeled and unlabeled data. 2) We propose a threshold-free Dual Competing OOD head that further improves the performance by suppressing the error accumulation during semi-supervised outlier filtering. 3) Our proposed method is an online end-to-end trainable OSSOD framework. Experimental results show that our method achieves state-of-the-art performance on several OSSOD benchmarks compared to existing methods. Moreover, additional experiments show that our method is more efficient and can be easily applied to different SSOD frameworks to boost their performance. △ Less

Submitted 21 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.12461 [pdf, other]

Towards Tracing Code Provenance with Code Watermarking

Authors: Wei Li, Borui Yang, Yujie Sun, Suyu Chen, Ziyun Song, Liyao Xiang, Xinbing Wang, Chenghu Zhou

Abstract: Recent advances in large language models have raised wide concern in generating abundant plausible source code without scrutiny, and thus tracing the provenance of code emerges as a critical issue. To solve the issue, we propose CodeMark, a watermarking system that hides bit strings into variables respecting the natural and operational semantics of the code. For naturalness, we novelly introduce a… ▽ More Recent advances in large language models have raised wide concern in generating abundant plausible source code without scrutiny, and thus tracing the provenance of code emerges as a critical issue. To solve the issue, we propose CodeMark, a watermarking system that hides bit strings into variables respecting the natural and operational semantics of the code. For naturalness, we novelly introduce a contextual watermarking scheme to generate watermarked variables more coherent in the context atop graph neural networks. Each variable is treated as a node on the graph and the node feature gathers neighborhood (context) information through learning. Watermarks embedded into the features are thus reflected not only by the variables but also by the local contexts. We further introduce a pre-trained model on source code as a teacher to guide more natural variable generation. Throughout the embedding, the operational semantics are preserved as only variable names are altered. Beyond guaranteeing code-specific properties, CodeMark is superior in watermarking accuracy, capacity, and efficiency due to a more diversified pattern generated. Experimental results show CodeMark outperforms the SOTA watermarking systems with a better balance of the watermarking requirements. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: 12 pages

MSC Class: 68T01 ACM Class: I.2.5

arXiv:2304.07735 [pdf, other]

Permutation Equivariance of Transformers and Its Applications

Authors: Hengyuan Xu, Liyao Xiang, Hangyu Ye, Dixi Yao, Pengzhi Chu, Baochun Li

Abstract: Revolutionizing the field of deep learning, Transformer-based models have achieved remarkable performance in many tasks. Recent research has recognized these models are robust to shuffling but are limited to inter-token permutation in the forward propagation. In this work, we propose our definition of permutation equivariance, a broader concept covering both inter- and intra- token permutation in… ▽ More Revolutionizing the field of deep learning, Transformer-based models have achieved remarkable performance in many tasks. Recent research has recognized these models are robust to shuffling but are limited to inter-token permutation in the forward propagation. In this work, we propose our definition of permutation equivariance, a broader concept covering both inter- and intra- token permutation in the forward and backward propagation of neural networks. We rigorously proved that such permutation equivariance property can be satisfied on most vanilla Transformer-based models with almost no adaptation. We examine the property over a range of state-of-the-art models including ViT, Bert, GPT, and others, with experimental validations. Further, as a proof-of-concept, we explore how real-world applications including privacy-enhancing split learning, and model authorization, could exploit the permutation equivariance property, which implicates wider, intriguing application scenarios. △ Less

Submitted 31 March, 2024; v1 submitted 16 April, 2023; originally announced April 2023.

Comments: Accepted by CVPR 2024

arXiv:2304.01106 [pdf, ps, other]

Crossword: A Semantic Approach to Data Compression via Masking

Authors: Mingxiao Li, Rui **, Liyao Xiang, Kaiming Shen, Shuguang Cui

Abstract: The traditional methods for data compression are typically based on the symbol-level statistics, with the information source modeled as a long sequence of i.i.d. random variables or a stochastic process, thus establishing the fundamental limit as entropy for lossless compression and as mutual information for lossy compression. However, the source (including text, music, and speech) in the real wor… ▽ More The traditional methods for data compression are typically based on the symbol-level statistics, with the information source modeled as a long sequence of i.i.d. random variables or a stochastic process, thus establishing the fundamental limit as entropy for lossless compression and as mutual information for lossy compression. However, the source (including text, music, and speech) in the real world is often statistically ill-defined because of its close connection to human perception, and thus the model-driven approach can be quite suboptimal. This study places careful emphasis on English text and exploits its semantic aspect to enhance the compression efficiency further. The main idea stems from the puzzle crossword, observing that the hidden words can still be precisely reconstructed so long as some key letters are provided. The proposed masking-based strategy resembles the above game. In a nutshell, the encoder evaluates the semantic importance of each word according to the semantic loss and then masks the minor ones, while the decoder aims to recover the masked words from the semantic context by means of the Transformer. Our experiments show that the proposed semantic approach can achieve much higher compression efficiency than the traditional methods such as Huffman code and UTF-8 code, while preserving the meaning in the target text to a great extent. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: 6 pages, 8 figures

arXiv:2303.16038 [pdf, other]

Polar Coded Integrated Data and Energy Networking: A Deep Neural Network Assisted End-to-End Design

Authors: **gwen Cui, Jie Hu, Kun Yang, Lajos Hanzo

Abstract: Wireless sensors are everywhere. To address their energy supply, we proposed an end-to-end design for polar-coded integrated data and energy networking (IDEN), where the conventional signal processing modules, such as modulation/demodulation and channel decoding, are replaced by deep neural networks (DNNs). Moreover, the input-output relationship of an energy harvester (EH) is also modelled by a D… ▽ More Wireless sensors are everywhere. To address their energy supply, we proposed an end-to-end design for polar-coded integrated data and energy networking (IDEN), where the conventional signal processing modules, such as modulation/demodulation and channel decoding, are replaced by deep neural networks (DNNs). Moreover, the input-output relationship of an energy harvester (EH) is also modelled by a DNN. By jointly optimizing both the transmitter and the receiver as an autoencoder (AE), we minimize the bit-error-rate (BER) and maximize the harvested energy of the IDEN system, while satisfying the transmit power budget constraint determined by the normalization layer in the transmitter. Our simulation results demonstrate that the DNN aided end-to-end design conceived outperforms its conventional model-based counterpart both in terms of the harvested energy and the BER. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2303.13089 [pdf, other]

Box-Level Active Detection

Authors: Mengyao Lyu, Jundong Zhou, Hui Chen, Yijie Huang, Dongdong Yu, Yaqian Li, Yandong Guo, Yuchen Guo, Liuyu Xiang, Guiguang Ding

Abstract: Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets… ▽ More Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at https://github.com/lyumengyao/blad. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: CVPR 2023 highlight

arXiv:2301.09422 [pdf, other]

doi 10.1609/aaai.v37i9.26244

HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks

Authors: **qi Xiao, Chengming Zhang, Yu Gong, Miao Yin, Yang Sui, Lizhi Xiang, Dingwen Tao, Bo Yuan

Abstract: Low-rank compression is an important model compression strategy for obtaining compact neural network models. In general, because the rank values directly determine the model complexity and model accuracy, proper selection of layer-wise rank is very critical and desired. To date, though many low-rank compression approaches, either selecting the ranks in a manual or automatic way, have been proposed… ▽ More Low-rank compression is an important model compression strategy for obtaining compact neural network models. In general, because the rank values directly determine the model complexity and model accuracy, proper selection of layer-wise rank is very critical and desired. To date, though many low-rank compression approaches, either selecting the ranks in a manual or automatic way, have been proposed, they suffer from costly manual trials or unsatisfied compression performance. In addition, all of the existing works are not designed in a hardware-aware way, limiting the practical performance of the compressed models on real-world hardware platforms. To address these challenges, in this paper we propose HALOC, a hardware-aware automatic low-rank compression framework. By interpreting automatic rank selection from an architecture search perspective, we develop an end-to-end solution to determine the suitable layer-wise ranks in a differentiable and hardware-aware way. We further propose design principles and mitigation strategy to efficiently explore the rank space and reduce the potential interference problem. Experimental results on different datasets and hardware platforms demonstrate the effectiveness of our proposed approach. On CIFAR-10 dataset, HALOC enables 0.07% and 0.38% accuracy increase over the uncompressed ResNet-20 and VGG-16 models with 72.20% and 86.44% fewer FLOPs, respectively. On ImageNet dataset, HALOC achieves 0.9% higher top-1 accuracy than the original ResNet-18 model with 66.16% fewer FLOPs. HALOC also shows 0.66% higher top-1 accuracy increase than the state-of-the-art automatic low-rank compression solution with fewer computational and memory costs. In addition, HALOC demonstrates the practical speedups on different hardware platforms, verified by the measurement results on desktop GPU, embedded GPU and ASIC accelerator. △ Less

Submitted 1 February, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

Comments: AAAI-23

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence. 37, 9 (Jun. 2023), 10464-10472

arXiv:2301.05565 [pdf, ps, other]

DINF: Dynamic Instance Noise Filter for Occluded Pedestrian Detection

Authors: Li Xiang, He Miao, Luo Haibo, Xiao Jiajie

Abstract: Occlusion issue is the biggest challenge in pedestrian detection. RCNN-based detectors extract instance features by crop** rectangle regions of interest in the feature maps. However, the visible pixels of the occluded objects are limited, making the rectangle instance feature mixed with a lot of instance-irrelevant noise information. Besides, by counting the number of instances with different de… ▽ More Occlusion issue is the biggest challenge in pedestrian detection. RCNN-based detectors extract instance features by crop** rectangle regions of interest in the feature maps. However, the visible pixels of the occluded objects are limited, making the rectangle instance feature mixed with a lot of instance-irrelevant noise information. Besides, by counting the number of instances with different degrees of overlap of CrowdHuman dataset, we find that the number of severely overlap** objects and the number of slightly overlap** objects are unbalanced, which may exacerbate the challenges posed by occlusion issues. Regarding to the noise issue, from the perspective of denoising, an iterable dynamic instance noise filter (DINF) is proposed for the RCNN-based pedestrian detectors to improve the signal-noise ratio of the instance feature. Simulating the wavelet denoising process, we use the instance feature vector to generate dynamic convolutional kernels to transform the RoIs features to a domain in which the near-zero values represent the noise information. Then, soft thresholding with channel-wise adaptive thresholds is applied to convert the near-zero values to zero to filter out noise information. For the imbalance issue, we propose an IoU-Focal factor (IFF) to modulate the contributions of the well-regressed boxes and the bad-regressed boxes to the loss in the training process, paying more attention to the minority severely overlap** objects. Extensive experiments conducted on CrowdHuman and CityPersons demonstrate that our methods can help RCNN-based pedestrian detectors achieve state-of-the-art performance. △ Less

Submitted 13 January, 2023; originally announced January 2023.

Comments: 15 pages, 8 figures

arXiv:2212.02800 [pdf, other]

Life-long Learning for Multilingual Neural Machine Translation with Knowledge Distillation

Authors: Yang Zhao, Junnan Zhu, Lu Xiang, Jiajun Zhang, Yu Zhou, Feifei Zhai, Chengqing Zong

Abstract: A common scenario of Multilingual Neural Machine Translation (MNMT) is that each translation task arrives in a sequential manner, and the training data of previous tasks is unavailable. In this scenario, the current methods suffer heavily from catastrophic forgetting (CF). To alleviate the CF, we investigate knowledge distillation based life-long learning methods. Specifically, in one-tomany scena… ▽ More A common scenario of Multilingual Neural Machine Translation (MNMT) is that each translation task arrives in a sequential manner, and the training data of previous tasks is unavailable. In this scenario, the current methods suffer heavily from catastrophic forgetting (CF). To alleviate the CF, we investigate knowledge distillation based life-long learning methods. Specifically, in one-tomany scenario, we propose a multilingual distillation method to make the new model (student) jointly learn multilingual output from old model (teacher) and new task. In many-to one scenario, we find that direct distillation faces the extreme partial distillation problem, and we propose two different methods to address it: pseudo input distillation and reverse teacher distillation. The experimental results on twelve translation tasks show that the proposed methods can better consolidate the previous knowledge and sharply alleviate the CF. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2211.14734 [pdf, other]

doi 10.18653/v1/2022.semeval-1.152

X-PuDu at SemEval-2022 Task 7: A Replaced Token Detection Task Pre-trained Model with Pattern-aware Ensembling for Identifying Plausible Clarifications

Authors: Junyuan Shang, Shuohuan Wang, Yu Sun, Yanjun Yu, Yue Zhou, Li Xiang, Guixiu Yang

Abstract: This paper describes our winning system on SemEval 2022 Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in Instructional Texts. A replaced token detection pre-trained model is utilized with minorly different task-specific heads for SubTask-A: Multi-class Classification and SubTask-B: Ranking. Incorporating a pattern-aware ensemble method, our system achieves a 6… ▽ More This paper describes our winning system on SemEval 2022 Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in Instructional Texts. A replaced token detection pre-trained model is utilized with minorly different task-specific heads for SubTask-A: Multi-class Classification and SubTask-B: Ranking. Incorporating a pattern-aware ensemble method, our system achieves a 68.90% accuracy score and 0.8070 spearman's rank correlation score surpassing the 2nd place with a large margin by 2.7 and 2.2 percent points for SubTask-A and SubTask-B, respectively. Our approach is simple and easy to implement, and we conducted ablation studies and qualitative and quantitative analyses for the working strategies used in our system. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: Accepted at the 16th International Workshop on Semantic Evaluation (SemEval-2022), NAACL

arXiv:2211.14429 [pdf, other]

Supervised Pretraining for Molecular Force Fields and Properties Prediction

Authors: Xiang Gao, Weihao Gao, Wenzhi Xiao, Zhirui Wang, Chong Wang, Liang Xiang

Abstract: Machine learning approaches have become popular for molecular modeling tasks, including molecular force fields and properties prediction. Traditional supervised learning methods suffer from scarcity of labeled data for particular tasks, motivating the use of large-scale dataset for other relevant tasks. We propose to pretrain neural networks on a dataset of 86 millions of molecules with atom charg… ▽ More Machine learning approaches have become popular for molecular modeling tasks, including molecular force fields and properties prediction. Traditional supervised learning methods suffer from scarcity of labeled data for particular tasks, motivating the use of large-scale dataset for other relevant tasks. We propose to pretrain neural networks on a dataset of 86 millions of molecules with atom charges and 3D geometries as inputs and molecular energies as labels. Experiments show that, compared to training from scratch, fine-tuning the pretrained model can significantly improve the performance for seven molecular property prediction tasks and two force field tasks. We also demonstrate that the learned representations from the pretrained model contain adequate information about molecular structures, by showing that linear probing of the representations can predict many molecular information including atom types, interatomic distances, class of molecular scaffolds, and existence of molecular fragments. Our results show that supervised pretraining is a promising research direction in molecular modeling △ Less

Submitted 23 November, 2022; originally announced November 2022.

Comments: AI4Science Workshop at NeurIPS 2022

arXiv:2211.12773 [pdf, other]

Learning Regularized Positional Encoding for Molecular Prediction

Authors: Xiang Gao, Weihao Gao, Wenzhi Xiao, Zhirui Wang, Chong Wang, Liang Xiang

Abstract: Machine learning has become a promising approach for molecular modeling. Positional quantities, such as interatomic distances and bond angles, play a crucial role in molecule physics. The existing works rely on careful manual design of their representation. To model the complex nonlinearity in predicting molecular properties in an more end-to-end approach, we propose to encode the positional quant… ▽ More Machine learning has become a promising approach for molecular modeling. Positional quantities, such as interatomic distances and bond angles, play a crucial role in molecule physics. The existing works rely on careful manual design of their representation. To model the complex nonlinearity in predicting molecular properties in an more end-to-end approach, we propose to encode the positional quantities with a learnable embedding that is continuous and differentiable. A regularization technique is employed to encourage embedding smoothness along the physical dimension. We experiment with a variety of molecular property and force field prediction tasks. Improved performance is observed for three different model architectures after plugging in the proposed positional encoding method. In addition, the learned positional encoding allows easier physics-based interpretation. We observe that tasks of similar physics have the similar learned positional encoding. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Comments: AI4Science Workshop at NeurIPS 2022

arXiv:2211.03715 [pdf, other]

doi 10.1145/3572848.3577478

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Authors: Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao

Abstract: Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference cod… ▽ More Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21X speedup over cuDNN, 1.12X speedup over TVM, and 3.27X over the original models using cuDNN with at most 0.05% accuracy loss. △ Less

Submitted 4 January, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: 14 pages, 9 figures, 3 tables, accepted by PPoPP '23

arXiv:2211.00826 [pdf, ps, other]

TSAA: A Two-Stage Anchor Assignment Method towards Anchor Drift in Crowded Object Detection

Authors: Li Xiang, He Miao, Luo Haibo, Yang Huiyuan, Xiao Jiajie

Abstract: Among current anchor-based detectors, a positive anchor box will be intuitively assigned to the object that overlaps it the most. The assigned label to each anchor will directly determine the optimization direction of the corresponding prediction box, including the direction of box regression and category prediction. In our practice of crowded object detection, however, the results show that a pos… ▽ More Among current anchor-based detectors, a positive anchor box will be intuitively assigned to the object that overlaps it the most. The assigned label to each anchor will directly determine the optimization direction of the corresponding prediction box, including the direction of box regression and category prediction. In our practice of crowded object detection, however, the results show that a positive anchor does not always regress toward the object that overlaps it the most when multiple objects overlap. We name it anchor drift. The anchor drift reflects that the anchor-object matching relation, which is determined by the degree of overlap between anchors and objects, is not always optimal. Conflicts between the fixed matching relation and learned experience in the past training process may cause ambiguous predictions and thus raise the false-positive rate. In this paper, a simple but efficient adaptive two-stage anchor assignment (TSAA) method is proposed. It utilizes the final prediction boxes rather than the fixed anchors to calculate the overlap degree with objects to determine which object to regress for each anchor. The participation of the prediction box makes the anchor-object assignment mechanism adaptive. Extensive experiments are conducted on three classic detectors RetinaNet, Faster-RCNN and YOLOv3 on CrowdHuman and COCO to evaluate the effectiveness of TSAA. The results show that TSAA can significantly improve the detectors' performance without additional computational costs or network structure changes. △ Less

Submitted 11 November, 2022; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: 11 pages, 8 figures

arXiv:2210.08068 [pdf, other]

Whole-body tumor segmentation of 18F -FDG PET/CT using a cascaded and ensembled convolutional neural networks

Authors: Ludovic Sibille, Xinrui Zhan, Lei Xiang

Abstract: Background: A crucial initial processing step for quantitative PET/CT analysis is the segmentation of tumor lesions enabling accurate feature ex-traction, tumor characterization, oncologic staging, and image-based therapy response assessment. Manual lesion segmentation is however associated with enormous effort and cost and is thus infeasible in clinical routine. Goal: The goal of this study was t… ▽ More Background: A crucial initial processing step for quantitative PET/CT analysis is the segmentation of tumor lesions enabling accurate feature ex-traction, tumor characterization, oncologic staging, and image-based therapy response assessment. Manual lesion segmentation is however associated with enormous effort and cost and is thus infeasible in clinical routine. Goal: The goal of this study was to report the performance of a deep neural network designed to automatically segment regions suspected of cancer in whole-body 18F-FDG PET/CT images in the context of the AutoPET challenge. Method: A cascaded approach was developed where a stacked ensemble of 3D UNET CNN processed the PET/CT images at a fixed 6mm resolution. A refiner network composed of residual layers enhanced the 6mm segmentation mask to the original resolution. Results: 930 cases were used to train the model. 50% were histologically proven cancer patients and 50% were healthy controls. We obtained a dice=0.68 on 84 stratified test cases. Manual and automatic Metabolic Tumor Volume (MTV) were highly correlated (R2 = 0.969,Slope = 0.947). Inference time was 89.7 seconds on average. Conclusion: The proposed algorithm accurately segmented regions suspicious for cancer in whole-body 18F -FDG PET/CT images. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Showing 1–50 of 99 results for author: Xiang, L