Search | arXiv e-print repository

Referring Atomic Video Action Recognition

Authors: Kunyu Peng, Jia Fu, Kailun Yang, Di Wen, Yufan Chen, Rui** Liu, Junwei Zheng, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

Abstract: We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic acti… ▽ More We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR

arXiv:2407.00955 [pdf, other]

Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy

Authors: Xiang Jiao, Dingzhu Wen, Guangxu Zhu, Wei Jiang, Wu Luo, Yuanming Shi

Abstract: Edge-device co-inference, which concerns the cooperation between edge devices and an edge server for completing inference tasks over wireless networks, has been a promising technique for enabling various kinds of intelligent services at the network edge, e.g., auto-driving. In this paradigm, the concerned design objective of the network shifts from the traditional communication throughput to the e… ▽ More Edge-device co-inference, which concerns the cooperation between edge devices and an edge server for completing inference tasks over wireless networks, has been a promising technique for enabling various kinds of intelligent services at the network edge, e.g., auto-driving. In this paradigm, the concerned design objective of the network shifts from the traditional communication throughput to the effective and efficient execution of the inference task underpinned by the network, measured by, e.g., the inference accuracy and latency. In this paper, a task-oriented over-the-air computation scheme is proposed for a multidevice artificial intelligence system. Particularly, a novel tractable inference accuracy metric is proposed for classification tasks, which is called minimum pair-wise discriminant gain. Unlike prior work measuring the average of all class pairs in feature space, it measures the minimum distance of all class pairs. By maximizing the minimum pair-wise discriminant gain instead of its average counterpart, any pair of classes can be better separated in the feature space, and thus leading to a balanced and improved inference accuracy for all classes. Besides, this paper jointly optimizes the minimum discriminant gain of all feature elements instead of separately maximizing that of each element in the existing designs. As a result, the transmit power can be adaptively allocated to the feature elements according to their different contributions to the inference accuracy, opening an extra degree of freedom to improve inference performance. Extensive experiments are conducted using a concrete use case of human motion recognition to verify the superiority of the proposed design over the benchmarking scheme. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: This paper was accepted by IEEE Transactions on Vehicular Technology on June 30, 2024

arXiv:2407.00592 [pdf]

Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP

Authors: Ayush Ranjan, Daniel Wen, Karthik Bhat

Abstract: Understanding the limitations and weaknesses of state-of-the-art models in artificial intelligence is crucial for their improvement and responsible application. In this research, we focus on CLIP, a model renowned for its integration of vision and language processing. Our objective is to uncover recurring problems and blind spots in CLIP's image comprehension. By delving into both the commonalitie… ▽ More Understanding the limitations and weaknesses of state-of-the-art models in artificial intelligence is crucial for their improvement and responsible application. In this research, we focus on CLIP, a model renowned for its integration of vision and language processing. Our objective is to uncover recurring problems and blind spots in CLIP's image comprehension. By delving into both the commonalities and disparities between CLIP and human image understanding, we augment our comprehension of these models' capabilities. Through our analysis, we reveal significant discrepancies in CLIP's interpretation of images compared to human perception, shedding light on areas requiring improvement. Our methodologies, the Discrepancy Analysis Framework (DAF) and the Transformative Caption Analysis for CLIP (TCAC), enable a comprehensive evaluation of CLIP's performance. We identify 14 systemic faults, including Action vs. Stillness confusion, Failure to identify the direction of movement or positioning of objects in the image, Hallucination of Water-like Features, Misattribution of Geographic Context, among others. By addressing these limitations, we lay the groundwork for the development of more accurate and nuanced image embedding models, contributing to advancements in artificial intelligence. △ Less

Submitted 30 June, 2024; originally announced July 2024.

ACM Class: F.2.2; I.2.7

arXiv:2406.16359 [pdf]

Improving Generative Adversarial Networks for Video Super-Resolution

Authors: Daniel Wen

Abstract: In this research, we explore different ways to improve generative adversarial networks for video super-resolution tasks from a base single image super-resolution GAN model. Our primary objective is to identify potential techniques that enhance these models and to analyze which of these techniques yield the most significant improvements. We evaluate our results using Peak Signal-to-Noise Ratio (PSN… ▽ More In this research, we explore different ways to improve generative adversarial networks for video super-resolution tasks from a base single image super-resolution GAN model. Our primary objective is to identify potential techniques that enhance these models and to analyze which of these techniques yield the most significant improvements. We evaluate our results using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). Our findings indicate that the most effective techniques include temporal smoothing, long short-term memory (LSTM) layers, and a temporal loss function. The integration of these methods results in an 11.97% improvement in PSNR and an 8% improvement in SSIM compared to the baseline video super-resolution generative adversarial network (GAN) model. This substantial improvement suggests potential further applications to enhance current state-of-the-art models. △ Less

Submitted 24 June, 2024; originally announced June 2024.

ACM Class: F.2.2; I.2.7

arXiv:2406.16346 [pdf]

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

Authors: Daniel Wen, Nafisa Hussain

Abstract: Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patt… ▽ More Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to its image encoder, cooking videos to its video encoder, and general cooking questions to its text encoder. Thus, we aim to remove all noise unrelated to cooking while improving our model's capabilities to generate specific ingredient lists and detailed instructions. As a result, our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset. While this may seem like a marginal increase, our model trains on an image instruction dataset 2.5% the size of Video-LLaVA's and a video instruction dataset 23.76% of Video-LLaVA's. △ Less

Submitted 24 June, 2024; originally announced June 2024.

ACM Class: F.2.2; I.2.7

arXiv:2406.16268 [pdf, other]

Efficient Antagonistic k-plex Enumeration in Signed Graphs

Authors: Lantian Xu, Rong-Hua Li, Dong Wen, Qiangqiang Dai, Guoren Wang, Lu Qin

Abstract: A signed graph is a graph where each edge receives a sign, positive or negative. The signed graph model has been used in many real applications, such as protein complex discovery and social network analysis. Finding cohesive subgraphs in signed graphs is a fundamental problem. A k-plex is a common model for cohesive subgraphs in which every vertex is adjacent to all but at most k vertices within t… ▽ More A signed graph is a graph where each edge receives a sign, positive or negative. The signed graph model has been used in many real applications, such as protein complex discovery and social network analysis. Finding cohesive subgraphs in signed graphs is a fundamental problem. A k-plex is a common model for cohesive subgraphs in which every vertex is adjacent to all but at most k vertices within the subgraph. In this paper, we propose the model of size-constrained antagonistic k-plex in a signed graph. The proposed model guarantees that the resulting subgraph is a k-plex and can be divided into two sub-k-plexes, both of which have positive inner edges and negative outer edges. This paper aims to identify all maximal antagonistic k-plexes in a signed graph. Through rigorous analysis, we show that the problem is NP-Hardness. We propose a novel framework for maximal antagonistic k-plexes utilizing set enumeration. Efficiency is improved through pivot pruning and early termination based on the color bound. Preprocessing techniques based on degree and dichromatic graphs effectively narrow the search space before enumeration. Extensive experiments on real-world datasets demonstrate our algorithm's efficiency, effectiveness, and scalability. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2405.07792 [pdf, other]

Optimal Matrix Sketching over Sliding Windows

Authors: Hanyan Yin, Dongxie Wen, Jiajun Li, Zhewei Wei, Xiao Zhang, Zengfeng Huang, Feifei Li

Abstract: Matrix sketching, aimed at approximating a matrix $\boldsymbol{A} \in \mathbb{R}^{N\times d}$ consisting of vector streams of length $N$ with a smaller sketching matrix $\boldsymbol{B} \in \mathbb{R}^{\ell\times d}, \ell \ll N$, has garnered increasing attention in fields such as large-scale data analytics and machine learning. A well-known deterministic matrix sketching method is the Frequent Dir… ▽ More Matrix sketching, aimed at approximating a matrix $\boldsymbol{A} \in \mathbb{R}^{N\times d}$ consisting of vector streams of length $N$ with a smaller sketching matrix $\boldsymbol{B} \in \mathbb{R}^{\ell\times d}, \ell \ll N$, has garnered increasing attention in fields such as large-scale data analytics and machine learning. A well-known deterministic matrix sketching method is the Frequent Directions algorithm, which achieves the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound and provides a covariance error guarantee of $\varepsilon = \lVert \boldsymbol{A}^\top \boldsymbol{A} - \boldsymbol{B}^\top \boldsymbol{B} \rVert_2/\lVert \boldsymbol{A} \rVert_F^2$. The matrix sketching problem becomes particularly interesting in the context of sliding windows, where the goal is to approximate the matrix $\boldsymbol{A}_W$, formed by input vectors over the most recent $N$ time units. However, despite recent efforts, whether achieving the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound on sliding windows is possible has remained an open question. In this paper, we introduce the DS-FD algorithm, which achieves the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound for matrix sketching over row-normalized, sequence-based sliding windows. We also present matching upper and lower space bounds for time-based and unnormalized sliding windows, demonstrating the generality and optimality of \dsfd across various sliding window models. This conclusively answers the open question regarding the optimal space bound for matrix sketching over sliding windows. Furthermore, we conduct extensive experiments with both synthetic and real-world datasets, validating our theoretical claims and thus confirming the correctness and effectiveness of our algorithm, both theoretically and empirically. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2404.06007 [pdf, other]

Collaborative Edge AI Inference over Cloud-RAN

Authors: Pengfei Zhang, Dingzhu Wen, Guangxu Zhu, Qimei Chen, Kaifeng Han, Yuanming Shi

Abstract: In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregatio… ▽ More In this paper, a cloud radio access network (Cloud-RAN) based collaborative edge AI inference architecture is proposed. Specifically, geographically distributed devices capture real-time noise-corrupted sensory data samples and extract the noisy local feature vectors, which are then aggregated at each remote radio head (RRH) to suppress sensing noise. To realize efficient uplink feature aggregation, we allow each RRH receives local feature vectors from all devices over the same resource blocks simultaneously by leveraging an over-the-air computation (AirComp) technique. Thereafter, these aggregated feature vectors are quantized and transmitted to a central processor (CP) for further aggregation and downstream inference tasks. Our aim in this work is to maximize the inference accuracy via a surrogate accuracy metric called discriminant gain, which measures the discernibility of different classes in the feature space. The key challenges lie on simultaneously suppressing the coupled sensing noise, AirComp distortion caused by hostile wireless channels, and the quantization error resulting from the limited capacity of fronthaul links. To address these challenges, this work proposes a joint transmit precoding, receive beamforming, and quantization error control scheme to enhance the inference accuracy. Extensive numerical experiments demonstrate the effectiveness and superiority of our proposed optimization algorithm compared to various baselines. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: This paper is accepted by IEEE Transactions on Communications on 08-Apr-2024

arXiv:2403.09975 [pdf, other]

Skeleton-Based Human Action Recognition with Noisy Labels

Authors: Yi Xu, Kunyu Peng, Di Wen, Rui** Liu, Junwei Zheng, Yufan Chen, Jiaming Zhang, Alina Roitberg, Kailun Yang, Rainer Stiefelhagen

Abstract: Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resul… ▽ More Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at https://github.com/xuyizdby/NoiseEraSAR. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: The source code will be made accessible at https://github.com/xuyizdby/NoiseEraSAR

arXiv:2403.09092 [pdf, other]

doi 10.1145/3589334.3645385

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Authors: Yupeng Li, Haorui He, ** Bai, Dacheng Wen

Abstract: The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be appli… ▽ More The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: Accepted by the ACM Web Conference 2024 (WWW 2024) oral, dataset available: https://github.com/TrustworthyComp

arXiv:2403.06529 [pdf, other]

Confidence-Aware RGB-D Face Recognition via Virtual Depth Synthesis

Authors: Zijian Chen, Mei Wang, Weihong Deng, Hongzhi Shi, Dongchao Wen, Yingjie Zhang, Xingchen Cui, Jian Zhao

Abstract: 2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose. Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information. However, collecting sufficient paired RGB-D training data is expensive and time-consuming, hindering wide deployment. In this work, we first construct a diverse depth datase… ▽ More 2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose. Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information. However, collecting sufficient paired RGB-D training data is expensive and time-consuming, hindering wide deployment. In this work, we first construct a diverse depth dataset generated by 3D Morphable Models for depth model pre-training. Then, we propose a domain-independent pre-training framework that utilizes readily available pre-trained RGB and depth models to separately perform face recognition without needing additional paired data for retraining. To seamlessly integrate the two distinct networks and harness the complementary benefits of RGB and depth information for improved accuracy, we propose an innovative Adaptive Confidence Weighting (ACW). This mechanism is designed to learn confidence estimates for each modality to achieve modality fusion at the score level. Our method is simple and lightweight, only requiring ACW training beyond the backbone models. Experiments on multiple public RGB-D face recognition benchmarks demonstrate state-of-the-art performance surpassing previous methods based on depth estimation and feature fusion, validating the efficacy of our approach. △ Less

Submitted 16 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 9 pages, 5 figures

arXiv:2403.02545 [pdf, other]

Wukong: Towards a Scaling Law for Large-Scale Recommendation

Authors: Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, Wenlin Chen

Abstract: Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets.… ▽ More Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short. △ Less

Submitted 4 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: 12 pages

arXiv:2403.00877 [pdf, other]

Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation

Authors: Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Dheevatsa Mudigere, Maxim Naumov

Abstract: We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global… ▽ More We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjoint towers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to each tower to reduce model complexity and communication volume through hierarchical feature interaction; and (3) Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactions and load balanced assignments to preserve model quality and training throughput via learned embeddings. We show that DMT can achieve up to 1.9x speedup compared to the state-of-the-art baselines without losing accuracy across multiple generations of hardware at large data center scales. △ Less

Submitted 2 May, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

arXiv:2401.07496 [pdf, other]

Low-Rank Gradient Compression with Error Feedback for MIMO Wireless Federated Learning

Authors: Mingzhao Guo, Dongzhu Liu, Osvaldo Simeone, Dingzhu Wen

Abstract: This paper presents a novel approach to enhance the communication efficiency of federated learning (FL) in multiple input and multiple output (MIMO) wireless systems. The proposed method centers on a low-rank matrix factorization strategy for local gradient compression based on alternating least squares, along with over-the-air computation and error feedback. The proposed protocol, termed over-the… ▽ More This paper presents a novel approach to enhance the communication efficiency of federated learning (FL) in multiple input and multiple output (MIMO) wireless systems. The proposed method centers on a low-rank matrix factorization strategy for local gradient compression based on alternating least squares, along with over-the-air computation and error feedback. The proposed protocol, termed over-the-air low-rank compression (Ota-LC), is demonstrated to have lower computation cost and lower communication overhead as compared to existing benchmarks while guaranteeing the same inference performance. As an example, when targeting a test accuracy of 80% on the Cifar-10 dataset, Ota-LC achieves a reduction in total communication costs of at least 30% when contrasted with benchmark schemes, while also reducing the computational complexity order by a factor equal to the sum of the dimension of the gradients. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: 5 pages, 3 figures, 27 references, submitted

arXiv:2401.01575 [pdf, other]

Enhancing Generalization of Invisible Facial Privacy Cloak via Gradient Accumulation

Authors: Xuannan Liu, Yaoyao Zhong, Weihong Deng, Hongzhi Shi, Xingchen Cui, Yunfeng Yin, Dongchao Wen

Abstract: The blooming of social media and face recognition (FR) systems has increased people's concern about privacy and security. A new type of adversarial privacy cloak (class-universal) can be applied to all the images of regular users, to prevent malicious FR systems from acquiring their identity information. In this work, we discover the optimization dilemma in the existing methods -- the local optima… ▽ More The blooming of social media and face recognition (FR) systems has increased people's concern about privacy and security. A new type of adversarial privacy cloak (class-universal) can be applied to all the images of regular users, to prevent malicious FR systems from acquiring their identity information. In this work, we discover the optimization dilemma in the existing methods -- the local optima problem in large-batch optimization and the gradient information elimination problem in small-batch optimization. To solve these problems, we propose Gradient Accumulation (GA) to aggregate multiple small-batch gradients into a one-step iterative gradient to enhance the gradient stability and reduce the usage of quantization operations. Experiments show that our proposed method achieves high performance on the Privacy-Commons dataset against black-box face recognition models. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2310.12937 [pdf, other]

End-to-End Delay Minimization based on Joint Optimization of DNN Partitioning and Resource Allocation for Cooperative Edge Inference

Authors: Xinrui Ye, Yanzan Sun, Dingzhu Wen, Guan** Pan, Shunqing Zhang

Abstract: Cooperative inference in Mobile Edge Computing (MEC), achieved by deploying partitioned Deep Neural Network (DNN) models between resource-constrained user equipments (UEs) and edge servers (ESs), has emerged as a promising paradigm. Firstly, we consider scenarios of continuous Artificial Intelligence (AI) task arrivals, like the object detection for video streams, and utilize a serial queuing mode… ▽ More Cooperative inference in Mobile Edge Computing (MEC), achieved by deploying partitioned Deep Neural Network (DNN) models between resource-constrained user equipments (UEs) and edge servers (ESs), has emerged as a promising paradigm. Firstly, we consider scenarios of continuous Artificial Intelligence (AI) task arrivals, like the object detection for video streams, and utilize a serial queuing model for the accurate evaluation of End-to-End (E2E) delay in cooperative edge inference. Secondly, to enhance the long-term performance of inference systems, we formulate a multi-slot stochastic E2E delay optimization problem that jointly considers model partitioning and multi-dimensional resource allocation. Finally, to solve this problem, we introduce a Lyapunov-guided Multi-Dimensional Optimization algorithm (LyMDO) that decouples the original problem into per-slot deterministic problems, where Deep Reinforcement Learning (DRL) and convex optimization are used for joint optimization of partitioning decisions and complementary resource allocation. Simulation results show that our approach effectively improves E2E delay while balancing long-term resource constraints. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 7 pages, 9 figures, 1 table, 1 algorithm, to be published in IEEE 98th Vehicular Technology Conference (VTC2023-Fall)

arXiv:2308.11312 [pdf, other]

Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network

Authors: Dong Wen, Tao Li, Chenglong Li, Pengye Xia, Hui Yang, Zhigang Sun

Abstract: Deep learning (DL) for network models have achieved excellent performance in the field and are becoming a promising component in future intelligent network system. Programmable in-network computing device has great potential to deploy DL for network models, however, existing device cannot afford to run a DL model. The main challenges of data-plane supporting DL-based network models lie in computin… ▽ More Deep learning (DL) for network models have achieved excellent performance in the field and are becoming a promising component in future intelligent network system. Programmable in-network computing device has great potential to deploy DL for network models, however, existing device cannot afford to run a DL model. The main challenges of data-plane supporting DL-based network models lie in computing power, task granularity, model generality and feature extracting. To address above problems, we propose Octopus: a heterogeneous in-network computing accelerator enabling DL for network models. A feature extractor is designed for fast and efficient feature extracting. Vector accelerator and systolic array work in a heterogeneous collaborative way, offering low-latency-highthroughput general computing ability for packet-and-flow-based tasks. Octopus also contains on-chip memory fabric for storage and connecting, and Risc-V core for global controlling. The proposed Octopus accelerator design is implemented on FPGA. Functionality and performance of Octopus are validated in several use-cases, achieving performance of 31Mpkt/s feature extracting, 207ns packet-based computing latency, and 90kflow/s flow-based computing throughput. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2308.06503 [pdf, other]

Integrated Sensing-Communication-Computation for Over-the-Air Edge AI Inference

Authors: Zeming Zhuang, Dingzhu Wen, Yuanming Shi, Guangxu Zhu, Sheng Wu, Dusit Niyato

Abstract: Edge-device co-inference refers to deploying well-trained artificial intelligent (AI) models at the network edge under the cooperation of devices and edge servers for providing ambient intelligent services. For enhancing the utilization of limited network resources in edge-device co-inference tasks from a systematic view, we propose a task-oriented scheme of integrated sensing, computation and com… ▽ More Edge-device co-inference refers to deploying well-trained artificial intelligent (AI) models at the network edge under the cooperation of devices and edge servers for providing ambient intelligent services. For enhancing the utilization of limited network resources in edge-device co-inference tasks from a systematic view, we propose a task-oriented scheme of integrated sensing, computation and communication (ISCC) in this work. In this system, all devices sense a target from the same wide view to obtain homogeneous noise-corrupted sensory data, from which the local feature vectors are extracted. All local feature vectors are aggregated at the server using over-the-air computation (AirComp) in a broadband channel with the orthogonal-frequency-division-multiplexing technique for suppressing the sensing and channel noise. The aggregated denoised global feature vector is further input to a server-side AI model for completing the downstream inference task. A novel task-oriented design criterion, called maximum minimum pair-wise discriminant gain, is adopted for classification tasks. It extends the distance of the closest class pair in the feature space, leading to a balanced and enhanced inference accuracy. Under this criterion, a problem of joint sensing power assignment, transmit precoding and receive beamforming is formulated. The challenge lies in three aspects: the coupling between sensing and AirComp, the joint optimization of all feature dimensions' AirComp aggregation over a broadband channel, and the complicated form of the maximum minimum pair-wise discriminant gain. To solve this problem, a task-oriented ISCC scheme with AirComp is proposed. Experiments based on a human motion recognition task are conducted to verify the advantages of the proposed scheme over the existing scheme and a baseline. △ Less

Submitted 12 August, 2023; originally announced August 2023.

Comments: This work was accepted by IEEE Transactions on Wireless Communications on Aug. 12, 2023

arXiv:2306.01162 [pdf, other]

Integrated Sensing-Communication-Computation for Edge Artificial Intelligence

Authors: Dingzhu Wen, Xiaoyang Li, Yong Zhou, Yuanming Shi, Sheng Wu, Chunxiao Jiang

Abstract: Edge artificial intelligence (AI) has been a promising solution towards 6G to empower a series of advanced techniques such as digital twins, holographic projection, semantic communications, and auto-driving, for achieving intelligence of everything. The performance of edge AI tasks, including edge learning and edge AI inference, depends on the quality of three highly coupled processes, i.e., sensi… ▽ More Edge artificial intelligence (AI) has been a promising solution towards 6G to empower a series of advanced techniques such as digital twins, holographic projection, semantic communications, and auto-driving, for achieving intelligence of everything. The performance of edge AI tasks, including edge learning and edge AI inference, depends on the quality of three highly coupled processes, i.e., sensing for data acquisition, computation for information extraction, and communication for information transmission. However, these three modules need to compete for network resources for enhancing their own quality-of-services. To this end, integrated sensing-communication-computation (ISCC) is of paramount significance for improving resource utilization as well as achieving the customized goals of edge AI tasks. By investigating the interplay among the three modules, this article presents various kinds of ISCC schemes for federated edge learning tasks and edge AI inference tasks in both application and physical layers. △ Less

Submitted 18 April, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: This paper was accepted by IEEE Internet of Things Magazine on April-18-2024

arXiv:2305.08420 [pdf, other]

Exploring Few-Shot Adaptation for Activity Recognition on Diverse Domains

Authors: Kunyu Peng, Di Wen, David Schneider, Jiaming Zhang, Kailun Yang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

Abstract: Domain adaptation is essential for activity recognition to ensure accurate and robust performance across diverse environments, sensor types, and data sources. Unsupervised domain adaptation methods have been extensively studied, yet, they require large-scale unlabeled data from the target domain. In this work, we focus on Few-Shot Domain Adaptation for Activity Recognition (FSDA-AR), which leverag… ▽ More Domain adaptation is essential for activity recognition to ensure accurate and robust performance across diverse environments, sensor types, and data sources. Unsupervised domain adaptation methods have been extensively studied, yet, they require large-scale unlabeled data from the target domain. In this work, we focus on Few-Shot Domain Adaptation for Activity Recognition (FSDA-AR), which leverages a very small amount of labeled target videos to achieve effective adaptation. This approach is appealing for applications because it only needs a few or even one labeled example per class in the target domain, ideal for recognizing rare but critical activities. However, the existing FSDA-AR works mostly focus on the domain adaptation on sports videos, where the domain diversity is limited. We propose a new FSDA-AR benchmark using five established datasets considering the adaptation on more diverse and challenging domains. Our results demonstrate that FSDA-AR performs comparably to unsupervised domain adaptation with significantly fewer labeled target domain samples. We further propose a novel approach, RelaMiX, to better leverage the few labeled target domain samples as knowledge guidance. RelaMiX encompasses a temporal relational attention network with relation dropout, alongside a cross-domain information alignment mechanism. Furthermore, it integrates a mechanism for mixing features within a latent space by using the few-shot target domain samples. The proposed RelaMiX solution achieves state-of-the-art performance on all datasets within the FSDA-AR benchmark. To encourage future research of few-shot domain adaptation for activity recognition, our code will be publicly available at https://github.com/KPeng9510/RelaMiX. △ Less

Submitted 27 April, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: The benchmark and source code will be publicly available at https://github.com/KPeng9510/RelaMiX

arXiv:2304.12212 [pdf, other]

AeonG: An Efficient Built-in Temporal Support in Graph Databases

Authors: Jiamin Hou, Zhanhao Zhao, Zhouyu Wang, Wei Lu, Guodong **, Dong Wen, Xiaoyong Du

Abstract: Real world graphs are often dynamic and evolve over time. It is crucial for storing and querying graph evolution in graph databases. However, existing works either suffer from high storage overhead or lack efficient temporal query support, or both. In this paper, we propose AeonG, a new graph database with built-in temporal support. AeonG is based on a novel temporal graph model. To fit this model… ▽ More Real world graphs are often dynamic and evolve over time. It is crucial for storing and querying graph evolution in graph databases. However, existing works either suffer from high storage overhead or lack efficient temporal query support, or both. In this paper, we propose AeonG, a new graph database with built-in temporal support. AeonG is based on a novel temporal graph model. To fit this model, we design a storage engine and a query engine. Our storage engine is hybrid, with one current storage to manage the most recent versions of graph objects, and another historical storage to manage the previous versions of graph objects. This separation makes the performance degradation of querying the most recent graph object versions as slight as possible. To reduce the historical storage overhead, we propose a novel anchor+delta strategy, in which we periodically create a complete version (namely anchor) of a graph object, and maintain every change (namely delta) between two adjacent anchors of the same object. To boost temporal query processing, we propose an anchor-based version retrieval technique in the query engine to skip unnecessary historical version traversals. Extensive experiments are conducted on both real and synthetic datasets. The results show that AeonG achieves up to 5.73X lower storage consumption and 2.57X lower temporal query latency against state-of-the-art approaches, while introducing only 9.74% performance degradation for supporting temporal features △ Less

Submitted 1 April, 2024; v1 submitted 24 April, 2023; originally announced April 2023.

Comments: VLDB 2024

arXiv:2304.02284 [pdf, other]

Gradient Attention Balance Network: Mitigating Face Recognition Racial Bias via Gradient Attention

Authors: Linzhi Huang, Mei Wang, Jiahao Liang, Weihong Deng, Hongzhi Shi, Dongchao Wen, Yingjie Zhang, Jian Zhao

Abstract: Although face recognition has made impressive progress in recent years, we ignore the racial bias of the recognition system when we pursue a high level of accuracy. Previous work found that for different races, face recognition networks focus on different facial regions, and the sensitive regions of darker-skinned people are much smaller. Based on this discovery, we propose a new de-bias method ba… ▽ More Although face recognition has made impressive progress in recent years, we ignore the racial bias of the recognition system when we pursue a high level of accuracy. Previous work found that for different races, face recognition networks focus on different facial regions, and the sensitive regions of darker-skinned people are much smaller. Based on this discovery, we propose a new de-bias method based on gradient attention, called Gradient Attention Balance Network (GABN). Specifically, we use the gradient attention map (GAM) of the face recognition network to track the sensitive facial regions and make the GAMs of different races tend to be consistent through adversarial learning. This method mitigates the bias by making the network focus on similar facial regions. In addition, we also use masks to erase the Top-N sensitive facial regions, forcing the network to allocate its attention to a larger facial region. This method expands the sensitive region of darker-skinned people and further reduces the gap between GAM of darker-skinned people and GAM of Caucasians. Extensive experiments show that GABN successfully mitigates racial bias in face recognition and learns more balanced performance for people of different races. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: Accepted by CVPR 2023 workshop

arXiv:2303.14646 [pdf, other]

A Survey of Machine Learning-Based Ride-Hailing Planning

Authors: Dacheng Wen, Yupeng Li, Francis C. M. Lau

Abstract: Ride-hailing is a sustainable transportation paradigm where riders access door-to-door traveling services through a mobile phone application, which has attracted a colossal amount of usage. There are two major planning tasks in a ride-hailing system: (1) matching, i.e., assigning available vehicles to pick up the riders, and (2) repositioning, i.e., proactively relocating vehicles to certain locat… ▽ More Ride-hailing is a sustainable transportation paradigm where riders access door-to-door traveling services through a mobile phone application, which has attracted a colossal amount of usage. There are two major planning tasks in a ride-hailing system: (1) matching, i.e., assigning available vehicles to pick up the riders, and (2) repositioning, i.e., proactively relocating vehicles to certain locations to balance the supply and demand of ride-hailing services. Recently, many studies of ride-hailing planning that leverage machine learning techniques have emerged. In this article, we present a comprehensive overview on latest developments of machine learning-based ride-hailing planning. To offer a clear and structured review, we introduce a taxonomy into which we carefully fit the different categories of related works according to the types of their planning tasks and solution schemes, which include collective matching, distributed matching, collective repositioning, distributed repositioning, and joint matching and repositioning. We further shed light on many real-world datasets and simulators that are indispensable for empirical studies on machine learning-based ride-hailing planning strategies. At last, we propose several promising research directions for this rapidly growing research and practical field. △ Less

Submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.10920 [pdf, other]

Task-Oriented Communications for 6G: Vision, Principles, and Technologies

Authors: Yuanming Shi, Yong Zhou, Dingzhu Wen, Youlong Wu, Chunxiao Jiang, Khaled B. Letaief

Abstract: Driven by the interplay among artificial intelligence, digital twin, and wireless networks, 6G is envisaged to go beyond data-centric services to provide intelligent and immersive experiences. To efficiently support intelligent tasks with customized service requirements, it becomes critical to develop novel information compression and transmission technologies, which typically involve coupled sens… ▽ More Driven by the interplay among artificial intelligence, digital twin, and wireless networks, 6G is envisaged to go beyond data-centric services to provide intelligent and immersive experiences. To efficiently support intelligent tasks with customized service requirements, it becomes critical to develop novel information compression and transmission technologies, which typically involve coupled sensing, communication, and computation processes. To this end, task-oriented communication stands out as a disruptive technology for 6G system design by exploiting the task-specific information structures and folding the communication goals into the design of task-level transmission strategies. In this article, by develo** task-oriented information extraction and network resource orchestration strategies, we demonstrate the effectiveness of task-oriented communication principles for typical intelligent tasks, including federated learning, edge inference, and semantic communication. △ Less

Submitted 22 March, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: This paper has been accepted by IEEE Wireless Communications, 2023

arXiv:2302.05621 [pdf, other]

Dive into the Resolution Augmentations and Metrics in Low Resolution Face Recognition: A Plain yet Effective New Baseline

Authors: Xu Ling, Yichen Lu, Wenqi Xu, Weihong Deng, Yingjie Zhang, Xingchen Cui, Hongzhi Shi, Dongchao Wen

Abstract: Although deep learning has significantly improved Face Recognition (FR), dramatic performance deterioration may occur when processing Low Resolution (LR) faces. To alleviate this, approaches based on unified feature space are proposed with the sacrifice under High Resolution (HR) circumstances. To deal with the huge domain gap between HR and LR domains and achieve the best on both domains, we firs… ▽ More Although deep learning has significantly improved Face Recognition (FR), dramatic performance deterioration may occur when processing Low Resolution (LR) faces. To alleviate this, approaches based on unified feature space are proposed with the sacrifice under High Resolution (HR) circumstances. To deal with the huge domain gap between HR and LR domains and achieve the best on both domains, we first took a closer look at the impacts of several resolution augmentations and then analyzed the difficulty of LR samples from the perspective of the model gradient produced by different resolution samples. Besides, we also find that the introduction of some resolutions could help the learning of lower resolutions. Based on these, we divide the LR samples into three difficulties according to the resolution and propose a more effective Multi-Resolution Augmentation. Then, due to the rapidly increasing domain gap as the resolution decreases, we carefully design a novel and effective metric loss based on a LogExp distance function that provides decent gradients to prevent oscillation near the convergence point or tolerance to small distance errors; it could also dynamically adjust the penalty for errors in different dimensions, allowing for more optimization of dimensions with large errors. Combining these two insights, our model could learn more general knowledge in a wide resolution range of images and balanced results can be achieved by our extremely simple framework. Moreover, the augmentations and metrics are the cornerstones of LRFR, so our method could be considered a new baseline for the LRFR task. Experiments on the LRFR datasets: SCface, XQLFW, and large-scale LRFR dataset: TinyFace demonstrate the effectiveness of our methods, while the degradation on HRFR datasets is significantly reduced. △ Less

Submitted 11 February, 2023; originally announced February 2023.

Comments: AAAI 2023 R2HCAI Workshop

arXiv:2212.01054 [pdf, other]

Model and Data Agreement for Learning with Noisy Labels

Authors: Yuhang Zhang, Weihong Deng, Xingchen Cui, Yunfeng Yin, Hongzhi Shi, Dongchao Wen

Abstract: Learning with noisy labels is a vital topic for practical deep learning as models should be robust to noisy open-world datasets in the wild. The state-of-the-art noisy label learning approach JoCoR fails when faced with a large ratio of noisy labels. Moreover, selecting small-loss samples can also cause error accumulation as once the noisy samples are mistakenly selected as small-loss samples, the… ▽ More Learning with noisy labels is a vital topic for practical deep learning as models should be robust to noisy open-world datasets in the wild. The state-of-the-art noisy label learning approach JoCoR fails when faced with a large ratio of noisy labels. Moreover, selecting small-loss samples can also cause error accumulation as once the noisy samples are mistakenly selected as small-loss samples, they are more likely to be selected again. In this paper, we try to deal with error accumulation in noisy label learning from both model and data perspectives. We introduce mean point ensemble to utilize a more robust loss function and more information from unselected samples to reduce error accumulation from the model perspective. Furthermore, as the flip images have the same semantic meaning as the original images, we select small-loss samples according to the loss values of flip images instead of the original ones to reduce error accumulation from the data perspective. Extensive experiments on CIFAR-10, CIFAR-100, and large-scale Clothing1M show that our method outperforms state-of-the-art noisy label learning methods with different levels of label noise. Our method can also be seamlessly combined with other noisy label learning methods to further improve their performance and generalize well to other tasks. The code is available in https://github.com/zyh-uaiaaaa/MDA-noisy-label-learning. △ Less

Submitted 24 December, 2022; v1 submitted 2 December, 2022; originally announced December 2022.

Comments: Accepted by AAAI2023 Workshop

arXiv:2211.01255 [pdf, other]

Task-Oriented Over-the-Air Computation for Multi-Device Edge AI

Authors: Dingzhu Wen, Xiang Jiao, Peixi Liu, Guangxu Zhu, Yuanming Shi, Kaibin Huang

Abstract: Departing from the classic paradigm of data-centric designs, the 6G networks for supporting edge AI features task-oriented techniques that focus on effective and efficient execution of AI task. Targeting end-to-end system performance, such techniques are sophisticated as they aim to seamlessly integrate sensing (data acquisition), communication (data transmission), and computation (data processing… ▽ More Departing from the classic paradigm of data-centric designs, the 6G networks for supporting edge AI features task-oriented techniques that focus on effective and efficient execution of AI task. Targeting end-to-end system performance, such techniques are sophisticated as they aim to seamlessly integrate sensing (data acquisition), communication (data transmission), and computation (data processing). Aligned with the paradigm shift, a task-oriented over-the-air computation (AirComp) scheme is proposed in this paper for multi-device split-inference system. In the considered system, local feature vectors, which are extracted from the real-time noisy sensory data on devices, are aggregated over-the-air by exploiting the waveform superposition in a multiuser channel. Then the aggregated features as received at a server are fed into an inference model with the result used for decision making or control of actuators. To design inference-oriented AirComp, the transmit precoders at edge devices and receive beamforming at edge server are jointly optimized to rein in the aggregation error and maximize the inference accuracy. The problem is made tractable by measuring the inference accuracy using a surrogate metric called discriminant gain, which measures the discernibility of two object classes in the application of object/event classification. It is discovered that the conventional AirComp beamforming design for minimizing the mean square error in generic AirComp with respect to the noiseless case may not lead to the optimal classification accuracy. The reason is due to the overlooking of the fact that feature dimensions have different sensitivity towards aggregation errors and are thus of different importance levels for classification. This issue is addressed in this work via a new task-oriented AirComp scheme designed by directly maximizing the derived discriminant gain. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2207.00969 [pdf, other]

Task-Oriented Sensing, Computation, and Communication Integration for Multi-Device Edge AI

Authors: Dingzhu Wen, Peixi Liu, Guangxu Zhu, Yuanming Shi, Jie Xu, Yonina C. Eldar, Shuguang Cui

Abstract: This paper studies a new multi-device edge artificial-intelligent (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC) to enable low-latency intelligent services at the network edge. In this system, multiple ISAC devices perform radar sensing to obtain multi-view data, and then offload the quantized version of extracted features to a cent… ▽ More This paper studies a new multi-device edge artificial-intelligent (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC) to enable low-latency intelligent services at the network edge. In this system, multiple ISAC devices perform radar sensing to obtain multi-view data, and then offload the quantized version of extracted features to a centralized edge server, which conducts model inference based on the cascaded feature vectors. Under this setup and by considering classification tasks, we measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain, which is defined as the distance of two classes in the Euclidean feature space under normalized covariance. To maximize the discriminant gain, we first quantify the influence of the sensing, computation, and communication processes on it with a derived closed-form expression. Then, an end-to-end task-oriented resource management approach is developed by integrating the three processes into a joint design. This integrated sensing, computation, and communication (ISCC) design approach, however, leads to a challenging non-convex optimization problem, due to the complicated form of discriminant gain and the device heterogeneity in terms of channel gain, quantization level, and generated feature subsets. Remarkably, the considered non-convex problem can be optimally solved based on the sum-of-ratios method. This gives the optimal ISCC scheme, that jointly determines the transmit power and time allocation at multiple devices for sensing and communication, as well as their quantization bits allocation for computation distortion control. By using human motions recognition as a concrete AI inference task, extensive experiments are conducted to verify the performance of our derived optimal ISCC scheme. △ Less

Submitted 3 July, 2022; originally announced July 2022.

arXiv:2205.12010 [pdf, other]

doi 10.1109/TIP.2020.3048632

SFace: Sigmoid-Constrained Hypersphere Loss for Robust Face Recognition

Authors: Yaoyao Zhong, Weihong Deng, Jiani Hu, Dongyue Zhao, Xian Li, Dongchao Wen

Abstract: Deep face recognition has achieved great success due to large-scale training databases and rapidly develo** loss functions. The existing algorithms devote to realizing an ideal idea: minimizing the intra-class distance and maximizing the inter-class distance. However, they may neglect that there are also low quality training images which should not be optimized in this strict way. Considering th… ▽ More Deep face recognition has achieved great success due to large-scale training databases and rapidly develo** loss functions. The existing algorithms devote to realizing an ideal idea: minimizing the intra-class distance and maximizing the inter-class distance. However, they may neglect that there are also low quality training images which should not be optimized in this strict way. Considering the imperfection of training databases, we propose that intra-class and inter-class objectives can be optimized in a moderate way to mitigate overfitting problem, and further propose a novel loss function, named sigmoid-constrained hypersphere loss (SFace). Specifically, SFace imposes intra-class and inter-class constraints on a hypersphere manifold, which are controlled by two sigmoid gradient re-scale functions respectively. The sigmoid curves precisely re-scale the intra-class and inter-class gradients so that training samples can be optimized to some degree. Therefore, SFace can make a better balance between decreasing the intra-class distances for clean examples and preventing overfitting to the label noise, and contributes more robust deep face recognition models. Extensive experiments of models trained on CASIA-WebFace, VGGFace2, and MS-Celeb-1M databases, and evaluated on several face recognition benchmarks, such as LFW, MegaFace and IJB-C databases, have demonstrated the superiority of SFace. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: 12 pages, 9 figures

Journal ref: IEEE Transactions on Image Processing, 2021

arXiv:2110.07449 [pdf, other]

zk-Fabric, a Polylithic Syntax Zero Knowledge Joint Proof System

Authors: Sheng Sun, Dr. Tong Wen

Abstract: In this paper, we create a single-use and full syntax zero-knowledge proof system, a.k.a zk-Fabric. Comparing with zk-SNARKS and another variant zero-knowledge proofing system, zkBOO and it's variant zkBOO++. We present multiple new approaches on how to use partitioned garbled circuits to achieve a joint zero-knowledge proof system, with the benefits of less overhead and full syntax verification.… ▽ More In this paper, we create a single-use and full syntax zero-knowledge proof system, a.k.a zk-Fabric. Comparing with zk-SNARKS and another variant zero-knowledge proofing system, zkBOO and it's variant zkBOO++. We present multiple new approaches on how to use partitioned garbled circuits to achieve a joint zero-knowledge proof system, with the benefits of less overhead and full syntax verification. zk-Fabric based on partitioned garbled circuits has the advantage of being versatile and single-use, meaning it can be applied to arbitrary circuits with more comprehensive statements, and it can achieve the non-interactivity among all participants. One of the protocols proposed within is used for creating a new kind of partitioned garbled circuits to match the comprehensive Boolean logical expression with multiple variables, we use the term "polythitic syntax" to refer to the context-based multiple variables in a comprehensive statement. We also designed a joint zero knowledge proof protocol that uses partitioned garbled circuits △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: 6 pages, 5 figures

arXiv:2110.00196 [pdf, other]

What is Semantic Communication? A View on Conveying Meaning in the Era of Machine Intelligence

Authors: Qiao Lan, Dingzhu Wen, Zezhong Zhang, Qunsong Zeng, Xu Chen, Petar Popovski, Kaibin Huang

Abstract: In 1940s, Claude Shannon developed the information theory focusing on quantifying the maximum data rate that can be supported by a communication channel. Guided by this, the main theme of wireless system design up until 5G was the data rate maximization. In his theory, the semantic aspect and meaning of messages were treated as largely irrelevant to communication. The classic theory started to rev… ▽ More In 1940s, Claude Shannon developed the information theory focusing on quantifying the maximum data rate that can be supported by a communication channel. Guided by this, the main theme of wireless system design up until 5G was the data rate maximization. In his theory, the semantic aspect and meaning of messages were treated as largely irrelevant to communication. The classic theory started to reveal its limitations in the modern era of machine intelligence, consisting of the synergy between IoT and AI. By broadening the scope of the classic framework, in this article we present a view of semantic communication (SemCom) and conveying meaning through the communication systems. We address three communication modalities, human-to-human (H2H), human-to-machine (H2M), and machine-to-machine (M2M) communications. The latter two, the main theme of the article, represent the paradigm shift in communication and computing. H2M SemCom refers to semantic techniques for conveying meanings understandable by both humans and machines so that they can interact. M2M SemCom refers to effectiveness techniques for efficiently connecting machines such that they can effectively execute a specific computation task in a wireless network. The first part of the article introduces SemCom principles including encoding, system architecture, and layer-coupling and end-to-end design approaches. The second part focuses on specific techniques for application areas of H2M (human and AI symbiosis, recommendation, etc.) and M2M SemCom (distributed learning, split inference, etc.) Finally, we discuss the knowledge graphs approach for designing SemCom systems. We believe that this comprehensive introduction will provide a useful guide into the emerging area of SemCom that is expected to play an important role in 6G featuring connected intelligence and integrated sensing, computing, communication, and control. △ Less

Submitted 30 September, 2021; originally announced October 2021.

Comments: This is an invited paper for Journal of Communications and Information Networks

arXiv:2109.15258 [pdf, other]

Federated Dropout -- A Simple Approach for Enabling Federated Learning on Resource Constrained Devices

Authors: Dingzhu Wen, Ki-Jun Jeon, Kaibin Huang

Abstract: Federated learning (FL) is a popular framework for training an AI model using distributed mobile data in a wireless network. It features data parallelism by distributing the learning task to multiple edge devices while attempting to preserve their local-data privacy. One main challenge confronting practical FL is that resource constrained devices struggle with the computation intensive task of upd… ▽ More Federated learning (FL) is a popular framework for training an AI model using distributed mobile data in a wireless network. It features data parallelism by distributing the learning task to multiple edge devices while attempting to preserve their local-data privacy. One main challenge confronting practical FL is that resource constrained devices struggle with the computation intensive task of updating of a deep-neural network model. To tackle the challenge, in this paper, a federated dropout (FedDrop) scheme is proposed building on the classic dropout scheme for random model pruning. Specifically, in each iteration of the FL algorithm, several subnets are independently generated from the global model at the server using dropout but with heterogeneous dropout rates (i.e., parameter-pruning probabilities),each of which is adapted to the state of an assigned channel. The subnets are downloaded to associated devices for updating. Thereby, FedDrop reduces both the communication overhead and devices' computation loads compared with the conventional FL while outperforming the latter in the case of overfitting and also the FL scheme with uniform dropout (i.e., identical subnets). △ Less

Submitted 5 February, 2022; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: This paper was accepted by IEEE Wireless Communications Letters

arXiv:2108.01020 [pdf, other]

RFC-HyPGCN: A Runtime Sparse Feature Compress Accelerator for Skeleton-Based GCNs Action Recognition Model with Hybrid Pruning

Authors: Dong Wen, **gfei Jiang, **wei Xu, Kang Wang, Tao Xiao, Yang Zhao, Yong Dou

Abstract: Skeleton-based Graph Convolutional Networks (GCNs) models for action recognition have achieved excellent prediction accuracy in the field. However, limited by large model and computation complexity, GCNs for action recognition like 2s-AGCN have insufficient power-efficiency and throughput on GPU. Thus, the demand of model reduction and hardware acceleration for low-power GCNs action recognition ap… ▽ More Skeleton-based Graph Convolutional Networks (GCNs) models for action recognition have achieved excellent prediction accuracy in the field. However, limited by large model and computation complexity, GCNs for action recognition like 2s-AGCN have insufficient power-efficiency and throughput on GPU. Thus, the demand of model reduction and hardware acceleration for low-power GCNs action recognition application becomes continuously higher. To address challenges above, this paper proposes a runtime sparse feature compress accelerator with hybrid pruning method: RFC-HyPGCN. First, this method skips both graph and spatial convolution workloads by reorganizing the multiplication order. Following spatial convolution workloads channel-pruning dataflow, a coarse-grained pruning method on temporal filters is designed, together with sampling-like fine-grained pruning on time dimension. Later, we come up with an architecture where all convolutional layers are mapped on chip to pursue high throughput. To further reduce storage resource utilization, online sparse feature compress format is put forward. Features are divided and encoded into several banks according to presented format, then bank storage is split into depth-variable mini-banks. Furthermore, this work applies quantization, input-skip** and intra-PE dynamic data scheduling to accelerate the model. In experiments, proposed pruning method is conducted on 2s-AGCN, acquiring 3.0x-8.4x model compression ratio and 73.20\% graph-skip** efficiency with balancing weight pruning. Implemented on Xilinx XCKU-115 FPGA, the proposed architecture has the peak performance of 1142 GOP/s and achieves up to 9.19x and 3.91x speedup over high-end GPU NVIDIA 2080Ti and NVIDIA V100, respectively. Compared with latest accelerator for action recognition GCNs models, our design reaches 22.9x speedup and 28.93\% improvement on DSP efficiency. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: 8 pages, 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP)

arXiv:2104.14126 [pdf, ps, other]

CASSOD-Net: Cascaded and Separable Structures of Dilated Convolution for Embedded Vision Systems and Applications

Authors: Tse-Wei Chen, Deyu Wang, Wei Tao, Dongchao Wen, Lingxiao Yin, Tadayuki Ito, Kinya Osa, Masami Kato

Abstract: The field of view (FOV) of convolutional neural networks is highly related to the accuracy of inference. Dilated convolutions are known as an effective solution to the problems which require large FOVs. However, for general-purpose hardware or dedicated hardware, it usually takes extra time to handle dilated convolutions compared with standard convolutions. In this paper, we propose a network modu… ▽ More The field of view (FOV) of convolutional neural networks is highly related to the accuracy of inference. Dilated convolutions are known as an effective solution to the problems which require large FOVs. However, for general-purpose hardware or dedicated hardware, it usually takes extra time to handle dilated convolutions compared with standard convolutions. In this paper, we propose a network module, Cascaded and Separable Structure of Dilated (CASSOD) Convolution, and a special hardware system to handle the CASSOD networks efficiently. A CASSOD-Net includes multiple cascaded $2 \times 2$ dilated filters, which can be used to replace the traditional $3 \times 3$ dilated filters without decreasing the accuracy of inference. Two example applications, face detection and image segmentation, are tested with dilated convolutions and the proposed CASSOD modules. The new network for face detection achieves higher accuracy than the previous work with only 47% of filter weights in the dilated convolution layers of the context module. Moreover, the proposed hardware system can accelerate the computations of dilated convolutions, and it is 2.78 times faster than traditional hardware systems when the filter size is $3 \times 3$. △ Less

Submitted 29 April, 2021; originally announced April 2021.

Comments: Camera-ready version for CVPR 2021 workshop (Embedded Vision Workshop)

arXiv:2104.14125 [pdf, ps, other]

doi 10.1007/978-3-030-68238-5_1

Hardware Architecture of Embedded Inference Accelerator and Analysis of Algorithms for Depthwise and Large-Kernel Convolutions

Authors: Tse-Wei Chen, Wei Tao, Deyu Wang, Dongchao Wen, Kinya Osa, Masami Kato

Abstract: In order to handle modern convolutional neural networks (CNNs) efficiently, a hardware architecture of CNN inference accelerator is proposed to handle depthwise convolutions and regular convolutions, which are both essential building blocks for embedded-computer-vision algorithms. Different from related works, the proposed architecture can support filter kernels with different sizes with high flex… ▽ More In order to handle modern convolutional neural networks (CNNs) efficiently, a hardware architecture of CNN inference accelerator is proposed to handle depthwise convolutions and regular convolutions, which are both essential building blocks for embedded-computer-vision algorithms. Different from related works, the proposed architecture can support filter kernels with different sizes with high flexibility since it does not require extra costs for intra-kernel parallelism, and it can generate convolution results faster than the architecture of the related works. The experimental results show the importance of supporting depthwise convolutions and dilated convolutions with the proposed hardware architecture. In addition to depthwise convolutions with large-kernels, a new structure called DDC layer, which includes the combination of depthwise convolutions and dilated convolutions, is also analyzed in this paper. For face detection, the computational costs decrease by 30%, and the model size decreases by 20% when the DDC layers are applied to the network. For image classification, the accuracy is increased by 1% by simply replacing $3 \times 3$ filters with $5 \times 5$ filters in depthwise convolutions. △ Less

Submitted 29 April, 2021; originally announced April 2021.

Comments: Camera-ready version for ECCV 2020 workshop (Embedded Vision Workshop)

Journal ref: ECCV 2020 Workshops, LNCS 12539, pp. 3-17, 2020

arXiv:2104.14124 [pdf, ps, other]

doi 10.1109/CVPRW.2019.00024

Condensation-Net: Memory-Efficient Network Architecture with Cross-Channel Pooling Layers and Virtual Feature Maps

Authors: Tse-Wei Chen, Motoki Yoshinaga, Hongxing Gao, Wei Tao, Dongchao Wen, Junjie Liu, Kinya Osa, Masami Kato

Abstract: "Lightweight convolutional neural networks" is an important research topic in the field of embedded vision. To implement image recognition tasks on a resource-limited hardware platform, it is necessary to reduce the memory size and the computational cost. The contribution of this paper is stated as follows. First, we propose an algorithm to process a specific network architecture (Condensation-Net… ▽ More "Lightweight convolutional neural networks" is an important research topic in the field of embedded vision. To implement image recognition tasks on a resource-limited hardware platform, it is necessary to reduce the memory size and the computational cost. The contribution of this paper is stated as follows. First, we propose an algorithm to process a specific network architecture (Condensation-Net) without increasing the maximum memory storage for feature maps. The architecture for virtual feature maps saves 26.5% of memory bandwidth by calculating the results of cross-channel pooling before storing the feature map into the memory. Second, we show that cross-channel pooling can improve the accuracy of object detection tasks, such as face detection, because it increases the number of filter weights. Compared with Tiny-YOLOv2, the improvement of accuracy is 2.0% for quantized networks and 1.5% for full-precision networks when the false-positive rate is 0.1. Last but not the least, the analysis results show that the overhead to support the cross-channel pooling with the proposed hardware architecture is negligible small. The extra memory cost to support Condensation-Net is 0.2% of the total size, and the extra gate count is only 1.0% of the total size. △ Less

Submitted 29 April, 2021; originally announced April 2021.

Comments: Camera-ready version for CVPR 2019 workshop (Embedded Vision Workshop)

Journal ref: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

arXiv:2012.11673 [pdf, ps, other]

Smoothed Gaussian Mixture Models for Video Classification and Recommendation

Authors: Sirjan Kafle, Aman Gupta, Xue Xia, Ananth Sankar, Xi Chen, Di Wen, Liang Zhang

Abstract: Cluster-and-aggregate techniques such as Vector of Locally Aggregated Descriptors (VLAD), and their end-to-end discriminatively trained equivalents like NetVLAD have recently been popular for video classification and action recognition tasks. These techniques operate by assigning video frames to clusters and then representing the video by aggregating residuals of frames with respect to the mean of… ▽ More Cluster-and-aggregate techniques such as Vector of Locally Aggregated Descriptors (VLAD), and their end-to-end discriminatively trained equivalents like NetVLAD have recently been popular for video classification and action recognition tasks. These techniques operate by assigning video frames to clusters and then representing the video by aggregating residuals of frames with respect to the mean of each cluster. Since some clusters may see very little video-specific data, these features can be noisy. In this paper, we propose a new cluster-and-aggregate method which we call smoothed Gaussian mixture model (SGMM), and its end-to-end discriminatively trained equivalent, which we call deep smoothed Gaussian mixture model (DSGMM). SGMM represents each video by the parameters of a Gaussian mixture model (GMM) trained for that video. Low-count clusters are addressed by smoothing the video-specific estimates with a universal background model (UBM) trained on a large number of videos. The primary benefit of SGMM over VLAD is smoothing which makes it less sensitive to small number of training samples. We show, through extensive experiments on the YouTube-8M classification task, that SGMM/DSGMM is consistently better than VLAD/NetVLAD by a small but statistically significant margin. We also show results using a dataset created at LinkedIn to predict if a member will watch an uploaded video. △ Less

Submitted 17 December, 2020; originally announced December 2020.

Comments: 11 pages, 3 figures, 7 tables

ACM Class: I.2.10

arXiv:2012.10711 [pdf, other]

Quantum reinforcement learning in continuous action space

Authors: Shaojun Wu, Shan **, Dingding Wen, Donghong Han, Xiaoting Wang

Abstract: Quantum reinforcement learning (QRL) is one promising algorithm proposed for near-term quantum devices. Early QRL proposals are effective at solving problems in discrete action space, but often suffer from the curse of dimensionality in the continuous domain due to discretization. To address this problem, we propose a quantum Deep Deterministic Policy Gradient algorithm that is efficient at solvin… ▽ More Quantum reinforcement learning (QRL) is one promising algorithm proposed for near-term quantum devices. Early QRL proposals are effective at solving problems in discrete action space, but often suffer from the curse of dimensionality in the continuous domain due to discretization. To address this problem, we propose a quantum Deep Deterministic Policy Gradient algorithm that is efficient at solving both classical and quantum sequential decision problems in the continuous domain. As an application, our method can solve the quantum state-generation problem in a single shot: it only requires a one-shot optimization to generate a model that outputs the desired control sequence for arbitrary target state. In comparison, the standard quantum control method requires optimizing for each target state. Moreover, our method can also be used to physically reconstruct an unknown quantum state. △ Less

Submitted 6 January, 2023; v1 submitted 19 December, 2020; originally announced December 2020.

Comments: 15 pages, 8 figures

arXiv:2010.04061 [pdf, other]

Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels

Authors: Dingzhu Wen, Ki-Jun Jeon, Mehdi Bennis, Kaibin Huang

Abstract: In this paper, we consider partitioned edge learning (PARTEL), which implements parameter-server training, a well known distributed learning method, in a wireless network. Thereby, PARTEL leverages distributed computation resources at edge devices to train a large-scale artificial intelligence (AI) model by dynamically partitioning the model into parametric blocks for separated updating at devices… ▽ More In this paper, we consider partitioned edge learning (PARTEL), which implements parameter-server training, a well known distributed learning method, in a wireless network. Thereby, PARTEL leverages distributed computation resources at edge devices to train a large-scale artificial intelligence (AI) model by dynamically partitioning the model into parametric blocks for separated updating at devices. Targeting broadband channels, we consider the joint control of parameter allocation, sub-channel allocation, and transmission power to improve the performance of PARTEL. Specifically, the policies for joint SUbcarrier, Parameter, and POweR allocaTion (SUPPORT) are optimized under the criterion of minimum learning latency. Two cases are considered. First, for the case of decomposable models (e.g., logistic regression), the latency-minimization problem is a mixed-integer program and non-convex. Due to its intractability, we develop a practical solution by integer relaxation and transforming it into an equivalent convex problem of model size maximization under a latency constraint. Thereby, a low-complexity algorithm is designed to compute the SUPPORT policy. Second, consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables. This, however, introduces constraints on model partitioning reducing the granularity of parameter allocation. The preceding policy is extended to DNN models by applying the proposed techniques of load rounding and proportional adjustment to rein in latency expansion caused by the load granularity constraints. △ Less

Submitted 18 March, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2009.13799 [pdf, other]

BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model

Authors: Junjie Liu, Dongchao Wen, Deyu Wang, Wei Tao, Tse-Wei Chen, Kinya Osa, Masami Kato

Abstract: Recent methods have significantly reduced the performance degradation of Binary Neural Networks (BNNs), but guaranteeing the effective and efficient training of BNNs is an unsolved problem. The main reason is that the estimated gradients produced by the Straight-Through-Estimator (STE) mismatches with the gradients of the real derivatives. In this paper, we provide an explicit convex optimization… ▽ More Recent methods have significantly reduced the performance degradation of Binary Neural Networks (BNNs), but guaranteeing the effective and efficient training of BNNs is an unsolved problem. The main reason is that the estimated gradients produced by the Straight-Through-Estimator (STE) mismatches with the gradients of the real derivatives. In this paper, we provide an explicit convex optimization example where training the BNNs with the traditionally adaptive optimization methods still faces the risk of non-convergence, and identify that constraining the range of gradients is critical for optimizing the deep binary model to avoid highly suboptimal solutions. For solving above issues, we propose a BAMSProd algorithm with a key observation that the convergence property of optimizing deep binary model is strongly related to the quantization errors. In brief, it employs an adaptive range constraint via an errors measurement for smoothing the gradients transition while follows the exponential moving strategy from AMSGrad to avoid errors accumulation during the optimization. The experiments verify the corollary of theoretical convergence analysis, and further demonstrate that our optimization method can speed up the convergence about 1:2x and boost the performance of BNNs to a significant level than the specific binary optimizer about 3:7%, even in a highly non-convex optimization problem. △ Less

Submitted 29 September, 2020; originally announced September 2020.

Comments: 10 pages, 4 figures, 2 tables

Journal ref: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

arXiv:2009.04626 [pdf, other]

QuantNet: Learning to Quantize by Learning within Fully Differentiable Framework

Authors: Junjie Liu, Dongchao Wen, Deyu Wang, Wei Tao, Tse-Wei Chen, Kinya Osa, Masami Kato

Abstract: Despite the achievements of recent binarization methods on reducing the performance degradation of Binary Neural Networks (BNNs), gradient mismatching caused by the Straight-Through-Estimator (STE) still dominates quantized networks. This paper proposes a meta-based quantizer named QuantNet, which utilizes a differentiable sub-network to directly binarize the full-precision weights without resorti… ▽ More Despite the achievements of recent binarization methods on reducing the performance degradation of Binary Neural Networks (BNNs), gradient mismatching caused by the Straight-Through-Estimator (STE) still dominates quantized networks. This paper proposes a meta-based quantizer named QuantNet, which utilizes a differentiable sub-network to directly binarize the full-precision weights without resorting to STE and any learnable gradient estimators. Our method not only solves the problem of gradient mismatching, but also reduces the impact of discretization errors, caused by the binarizing operation in the deployment, on performance. Generally, the proposed algorithm is implemented within a fully differentiable framework, and is easily extended to the general network quantization with any bits. The quantitative experiments on CIFAR-100 and ImageNet demonstrate that QuantNet achieves the signifficant improvements comparing with previous binarization methods, and even bridges gaps of accuracies between binarized models and full-precision models. △ Less

Submitted 9 September, 2020; originally announced September 2020.

Comments: Accepted for publication in ECCV Workshop 2020

arXiv:2006.15980 [pdf, ps, other]

Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems

Authors: Yuanhang Yu, Dong Wen, Ying Zhang, Xiaoyang Wang, Wenjie Zhang, Xuemin Lin

Abstract: Matrix Factorization (MF) has been widely applied in machine learning and data mining. A large number of algorithms have been studied to factorize matrices. Among them, stochastic gradient descent (SGD) is a commonly used method. Heterogeneous systems with multi-core CPUs and GPUs have become more and more promising recently due to the prevalence of GPUs in general-purpose data-parallel applicatio… ▽ More Matrix Factorization (MF) has been widely applied in machine learning and data mining. A large number of algorithms have been studied to factorize matrices. Among them, stochastic gradient descent (SGD) is a commonly used method. Heterogeneous systems with multi-core CPUs and GPUs have become more and more promising recently due to the prevalence of GPUs in general-purpose data-parallel applications. Due to the large computational cost of MF, we aim to improve the efficiency of SGD-based MF computation by utilizing the massive parallel processing power of heterogeneous multiprocessors. The main challenge in parallel SGD algorithms on heterogeneous CPU-GPU systems lies in the granularity of the matrix division and the strategy to assign tasks. We design a novel strategy to divide the matrix into a set of blocks by considering two aspects. First, we observe that the matrix should be divided nonuniformly, and relatively large blocks should be assigned to GPUs to saturate the computing power of GPUs. In addition to exploiting the characteristics of hardware, the workloads assigned to two types of hardware should be balanced. Aiming at the final division strategy, we design a cost model tailored for our problem to accurately estimate the performance of hardware on different data sizes. A dynamic scheduling policy is also used to further balance workloads in practice. Extensive experiments show that our proposed algorithm achieves high efficiency with a high quality of training quality. △ Less

Submitted 24 June, 2020; originally announced June 2020.

arXiv:2004.00490 [pdf, other]

Scheduling for Cellular Federated Edge Learning with Importance and Channel Awareness

Authors: **ke Ren, Yinghui He, Dingzhu Wen, Guanding Yu, Kaibin Huang, Dongning Guo

Abstract: In cellular federated edge learning (FEEL), multiple edge devices holding local data jointly train a neural network by communicating learning updates with an access point without exchanging their data samples. With very limited communication resources, it is beneficial to schedule the most informative local learning updates. In this paper, a novel scheduling policy is proposed to exploit both dive… ▽ More In cellular federated edge learning (FEEL), multiple edge devices holding local data jointly train a neural network by communicating learning updates with an access point without exchanging their data samples. With very limited communication resources, it is beneficial to schedule the most informative local learning updates. In this paper, a novel scheduling policy is proposed to exploit both diversity in multiuser channels and diversity in the "importance" of the edge devices' learning updates. First, a new probabilistic scheduling framework is developed to yield unbiased update aggregation in FEEL. The importance of a local learning update is measured by its gradient divergence. If one edge device is scheduled in each communication round, the scheduling policy is derived in closed form to achieve the optimal trade-off between channel quality and update importance. The probabilistic scheduling framework is then extended to allow scheduling multiple edge devices in each communication round. Numerical results obtained using popular models and learning datasets demonstrate that the proposed scheduling policy can achieve faster model convergence and higher learning accuracy than conventional scheduling policies that only exploit a single type of diversity. △ Less

Submitted 23 June, 2020; v1 submitted 1 April, 2020; originally announced April 2020.

Comments: This is an extended version of a submission to IEEE journal

arXiv:2003.04544 [pdf, other]

Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning

Authors: Dingzhu Wen, Mehdi Bennis, Kaibin Huang

Abstract: To leverage data and computation capabilities of mobile devices, machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models, resulting in the new paradigm of edge learning. In this paper, we consider the framework of partitioned edge learning for iteratively training a large-scale model using many resource-constrained devices (called workers). To… ▽ More To leverage data and computation capabilities of mobile devices, machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models, resulting in the new paradigm of edge learning. In this paper, we consider the framework of partitioned edge learning for iteratively training a large-scale model using many resource-constrained devices (called workers). To this end, in each iteration, the model is dynamically partitioned into parametric blocks, which are downloaded to worker groups for updating using data subsets. Then, the local updates are uploaded to and cascaded by the server for updating a global model. To reduce resource usage by minimizing the total learning-and-communication latency, this work focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation (for downloading and uploading). Two design approaches are adopted. First, a practical sequential approach, called partially integrated parameter-and-bandwidth allocation (PABA), yields two schemes, namely bandwidth aware parameter allocation and parameter aware bandwidth allocation. The former minimizes the load for the slowest (in computing) of worker groups, each training a same parametric block. The latter allocates the largest bandwidth to the worker being the latency bottleneck. Second, PABA are jointly optimized. Despite its being a nonconvex problem, an efficient and optimal solution algorithm is derived by intelligently nesting a bisection search and solving a convex problem. Experimental results using real data demonstrate that integrating PABA can substantially improve the performance of partitioned edge learning in terms of latency (by e.g., 46%) and accuracy (by e.g., 4%). △ Less

Submitted 29 June, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

arXiv:2003.00680 [pdf, other]

Graph3S: A Simple, Speedy and Scalable Distributed Graph Processing System

Authors: Xubo Wang, Lu Qin, Lijun Chang, Ying Zhang, Dong Wen, Xuemin Lin

Abstract: Graph is a ubiquitous structure in many domains. The rapidly increasing data volume calls for efficient and scalable graph data processing. In recent years, designing distributed graph processing systems has been an increasingly important area to fulfil the demands of processing big graphs in a distributed environment. Though a variety of distributed graph processing systems have been developed, v… ▽ More Graph is a ubiquitous structure in many domains. The rapidly increasing data volume calls for efficient and scalable graph data processing. In recent years, designing distributed graph processing systems has been an increasingly important area to fulfil the demands of processing big graphs in a distributed environment. Though a variety of distributed graph processing systems have been developed, very little attention has been paid to achieving a good combinational system performance in terms of usage simplicity, efficiency and scalability. To contribute to the study of distributed graph processing system, this work tries to fill this gap by designing a simple, speedy and scalable system. Our observation is that enforcing the communication flexibility of a system leads to the gains of both system efficiency and scalability as well as simple usage. We realize our idea in a system Graph3S and conduct extensive experiments with diverse algorithms over big graphs from different domains to test its performance. The results show that, besides simple usage, our system has outstanding performance over various graph algorithms and can even reach up to two orders of magnitude speedup over existing in-memory systems when applying to some algorithms. Also, its scalability is competitive to disk-based systems and even better when less machines are used. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1911.08076 [pdf, other]

IFQ-Net: Integrated Fixed-point Quantization Networks for Embedded Vision

Authors: Hongxing Gao, Wei Tao, Dongchao Wen, Tse-Wei Chen, Kinya Osa, Masami Kato

Abstract: Deploying deep models on embedded devices has been a challenging problem since the great success of deep learning based networks. Fixed-point networks, which represent their data with low bits fixed-point and thus give remarkable savings on memory usage, are generally preferred. Even though current fixed-point networks employ relative low bits (e.g. 8-bits), the memory saving is far from enough fo… ▽ More Deploying deep models on embedded devices has been a challenging problem since the great success of deep learning based networks. Fixed-point networks, which represent their data with low bits fixed-point and thus give remarkable savings on memory usage, are generally preferred. Even though current fixed-point networks employ relative low bits (e.g. 8-bits), the memory saving is far from enough for the embedded devices. On the other hand, quantization deep networks, for example XNOR-Net and HWGQNet, quantize the data into 1 or 2 bits resulting in more significant memory savings but still contain lots of floatingpoint data. In this paper, we propose a fixed-point network for embedded vision tasks through converting the floatingpoint data in a quantization network into fixed-point. Furthermore, to overcome the data loss caused by the conversion, we propose to compose floating-point data operations across multiple layers (e.g. convolution, batch normalization and quantization layers) and convert them into fixedpoint. We name the fixed-point network obtained through such integrated conversion as Integrated Fixed-point Quantization Networks (IFQ-Net). We demonstrate that our IFQNet gives 2.16x and 18x more savings on model size and runtime feature map memory respectively with similar accuracy on ImageNet. Furthermore, based on YOLOv2, we design IFQ-Tinier-YOLO face detector which is a fixed-point network with 256x reduction in model size (246k Bytes) than Tiny-YOLO. We illustrate the promising performance of our face detector in terms of detection rate on Face Detection Data Set and Bencmark (FDDB) and qualitative results of detecting small faces of Wider Face dataset. △ Less

Submitted 18 November, 2019; originally announced November 2019.

Comments: 9 pages, 6 figures

Journal ref: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018) Workshops

arXiv:1911.05341 [pdf, other]

DupNet: Towards Very Tiny Quantized CNN with Improved Accuracy for Face Detection

Authors: Hongxing Gao, Wei Tao, Dongchao Wen, Junjie Liu, Tse-Wei Chen, Kinya Osa, Masami Kato

Abstract: Deploying deep learning based face detectors on edge devices is a challenging task due to the limited computation resources. Even though binarizing the weights of a very tiny network gives impressive compactness on model size (e.g. 240.9 KB for IFQ-Tinier-YOLO), it is not tiny enough to fit in the embedded devices with strict memory constraints. In this paper, we propose DupNet which consists of t… ▽ More Deploying deep learning based face detectors on edge devices is a challenging task due to the limited computation resources. Even though binarizing the weights of a very tiny network gives impressive compactness on model size (e.g. 240.9 KB for IFQ-Tinier-YOLO), it is not tiny enough to fit in the embedded devices with strict memory constraints. In this paper, we propose DupNet which consists of two parts. Firstly, we employ weights with duplicated channels for the weight-intensive layers to reduce the model size. Secondly, for the quantization-sensitive layers whose quantization causes notable accuracy drop, we duplicate its input feature maps. It allows us to use more weights channels for convolving more representative outputs. Based on that, we propose a very tiny face detector, DupNet-Tinier-YOLO, which is 6.5X times smaller on model size and 42.0% less complex on computation and meanwhile achieves 2.4% higher detection than IFQ-Tinier-YOLO. Comparing with the full precision Tiny-YOLO, our DupNet-Tinier-YOLO gives 1,694.2X and 389.9X times savings on model size and computation complexity respectively with only 4.0% drop on detection rate (0.880 vs. 0.920). Moreover, our DupNet-Tinier-YOLO is only 36.9 KB, which is the tiniest deep face detector to our best knowledge. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Journal ref: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019) Workshops

arXiv:1911.05329 [pdf, other]

Knowledge Representing: Efficient, Sparse Representation of Prior Knowledge for Knowledge Distillation

Authors: Junjie Liu, Dongchao Wen, Hongxing Gao, Wei Tao, Tse-Wei Chen, Kinya Osa, Masami Kato

Abstract: Despite the recent works on knowledge distillation (KD) have achieved a further improvement through elaborately modeling the decision boundary as the posterior knowledge, their performance is still dependent on the hypothesis that the target network has a powerful capacity (representation ability). In this paper, we propose a knowledge representing (KR) framework mainly focusing on modeling the pa… ▽ More Despite the recent works on knowledge distillation (KD) have achieved a further improvement through elaborately modeling the decision boundary as the posterior knowledge, their performance is still dependent on the hypothesis that the target network has a powerful capacity (representation ability). In this paper, we propose a knowledge representing (KR) framework mainly focusing on modeling the parameters distribution as prior knowledge. Firstly, we suggest a knowledge aggregation scheme in order to answer how to represent the prior knowledge from teacher network. Through aggregating the parameters distribution from teacher network into more abstract level, the scheme is able to alleviate the phenomenon of residual accumulation in the deeper layers. Secondly, as the critical issue of what the most important prior knowledge is for better distilling, we design a sparse recoding penalty for constraining the student network to learn with the penalized gradients. With the proposed penalty, the student network can effectively avoid the over-regularization during knowledge distilling and converge faster. The quantitative experiments exhibit that the proposed framework achieves the state-ofthe-arts performance, even though the target network does not have the expected capacity. Moreover, the framework is flexible enough for combining with other KD methods based on the posterior knowledge. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Journal ref: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019)

arXiv:1911.03878 [pdf, other]

An Overview of Data-Importance Aware Radio Resource Management for Edge Machine Learning

Authors: Dingzhu Wen, Xiaoyang Li, Qunsong Zeng, **ke Ren, Kaibin Huang

Abstract: The 5G network connecting billions of Internet-of-Things (IoT) devices will make it possible to harvest an enormous amount of real-time mobile data. Furthermore, the 5G virtualization architecture will enable cloud computing at the (network) edge. The availability of both rich data and computation power at the edge has motivated Internet companies to deploy artificial intelligence (AI) there, crea… ▽ More The 5G network connecting billions of Internet-of-Things (IoT) devices will make it possible to harvest an enormous amount of real-time mobile data. Furthermore, the 5G virtualization architecture will enable cloud computing at the (network) edge. The availability of both rich data and computation power at the edge has motivated Internet companies to deploy artificial intelligence (AI) there, creating the hot area of edge-AI. Edge learning, the theme of this project, concerns training edge-AI models, which endow on IoT devices intelligence for responding to real-time events. However, the transmission of high-dimensional data from many edge devices to servers can result in excessive communication latency, creating a bottleneck for edge learning. Traditional wireless techniques deigned for only radio access are ineffective in tackling the challenge. Attempts to overcome the communication bottleneck has led to the development of a new class of techniques for intelligent radio resource management (RRM), called data-importance aware RRM. Their designs feature the interplay of active machine learning and wireless communication. Specifically, the metrics that measure data importance in active learning (e.g., classification uncertainty and data diversity) are applied to RRM for efficient acquisition of distributed data in wireless networks to train AI models at servers. This article aims at providing an introduction to the emerging area of importance-aware RRM. To this end, we will introduce the design principles, survey recent advancements in the area, discuss some design examples, and suggest some promising research opportunities. △ Less

Submitted 8 December, 2019; v1 submitted 10 November, 2019; originally announced November 2019.

Comments: This work is an invited paper for Journal of Communications and Information Networks

arXiv:1903.03762 [pdf, ps, other]

Mutual Clustering on Comparative Texts via Heterogeneous Information Networks

Authors: Jian** Cao, Senzhang Wang, Danyan Wen, Zhaohui Peng, Philip S. Yu, Fei-yue Wang

Abstract: Currently, many intelligence systems contain the texts from multi-sources, e.g., bulletin board system (BBS) posts, tweets and news. These texts can be ``comparative'' since they may be semantically correlated and thus provide us with different perspectives toward the same topics or events. To better organize the multi-sourced texts and obtain more comprehensive knowledge, we propose to study the… ▽ More Currently, many intelligence systems contain the texts from multi-sources, e.g., bulletin board system (BBS) posts, tweets and news. These texts can be ``comparative'' since they may be semantically correlated and thus provide us with different perspectives toward the same topics or events. To better organize the multi-sourced texts and obtain more comprehensive knowledge, we propose to study the novel problem of Mutual Clustering on Comparative Texts (MCCT), which aims to cluster the comparative texts simultaneously and collaboratively. The MCCT problem is difficult to address because 1) comparative texts usually present different data formats and structures and thus they are hard to organize, and 2) there lacks an effective method to connect the semantically correlated comparative texts to facilitate clustering them in an unified way. To this aim, in this paper we propose a Heterogeneous Information Network-based Text clustering framework HINT. HINT first models multi-sourced texts (e.g. news and tweets) as heterogeneous information networks by introducing the shared ``anchor texts'' to connect the comparative texts. Next, two similarity matrices based on HINT as well as a transition matrix for cross-text-source knowledge transfer are constructed. Comparative texts clustering are then conducted by utilizing the constructed matrices. Finally, a mutual clustering algorithm is also proposed to further unify the separate clustering results of the comparative texts by introducing a clustering consistency constraint. We conduct extensive experimental on three tweets-news datasets, and the results demonstrate the effectiveness and robustness of the proposed method in addressing the MCCT problem. △ Less

Submitted 9 March, 2019; originally announced March 2019.

Journal ref: Knowledge and Information System, 2019

Showing 1–50 of 56 results for author: Wen, D