Search | arXiv e-print repository

Unveiling the Unknown: Conditional Evidence Decoupling for Unknown Rejection

Authors: Zhaowei Wu, Binyi Su, Hua Zhang, Zhong Zhou

Abstract: In this paper, we focus on training an open-set object detector under the condition of scarce training samples, which should distinguish the known and unknown categories. Under this challenging scenario, the decision boundaries of unknowns are difficult to learn and often ambiguous. To mitigate this issue, we develop a novel open-set object detection framework, which delves into conditional eviden… ▽ More In this paper, we focus on training an open-set object detector under the condition of scarce training samples, which should distinguish the known and unknown categories. Under this challenging scenario, the decision boundaries of unknowns are difficult to learn and often ambiguous. To mitigate this issue, we develop a novel open-set object detection framework, which delves into conditional evidence decoupling for the unknown rejection. Specifically, we select pseudo-unknown samples by leveraging the discrepancy in attribution gradients between known and unknown classes, alleviating the inadequate unknown distribution coverage of training data. Subsequently, we propose a Conditional Evidence Decoupling Loss (CEDL) based on Evidential Deep Learning (EDL) theory, which decouples known and unknown properties in pseudo-unknown samples to learn distinct knowledge, enhancing separability between knowns and unknowns. Additionally, we propose an Abnormality Calibration Loss (ACL), which serves as a regularization term to adjust the output probability distribution, establishing robust decision boundaries for the unknown rejection. Our method has achieved the superiority performance over previous state-of-the-art approaches, improving the mean recall of unknown class by 7.24% across all shots in VOC10-5-5 dataset settings and 1.38% in VOC-COCO dataset settings. The code is available via https://github.com/zjzwzw/CED-FOOD. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.11092 [pdf, other]

Guaranteed Sampling Flexibility for Low-tubal-rank Tensor Completion

Authors: Bowen Su, Juntao You, HanQin Cai, Longxiu Huang

Abstract: While Bernoulli sampling is extensively studied in tensor completion, t-CUR sampling approximates low-tubal-rank tensors via lateral and horizontal subtensors. However, both methods lack sufficient flexibility for diverse practical applications. To address this, we introduce Tensor Cross-Concentrated Sampling (t-CCS), a novel and straightforward sampling model that advances the matrix cross-concen… ▽ More While Bernoulli sampling is extensively studied in tensor completion, t-CUR sampling approximates low-tubal-rank tensors via lateral and horizontal subtensors. However, both methods lack sufficient flexibility for diverse practical applications. To address this, we introduce Tensor Cross-Concentrated Sampling (t-CCS), a novel and straightforward sampling model that advances the matrix cross-concentrated sampling concept within a tensor framework. t-CCS effectively bridges the gap between Bernoulli and t-CUR sampling, offering additional flexibility that can lead to computational savings in various contexts. A key aspect of our work is the comprehensive theoretical analysis provided. We establish a sufficient condition for the successful recovery of a low-rank tensor from its t-CCS samples. In support of this, we also develop a theoretical framework validating the feasibility of t-CUR via uniform random sampling and conduct a detailed theoretical sampling complexity analysis for tensor completion problems utilizing the general Bernoulli sampling model. Moreover, we introduce an efficient non-convex algorithm, the Iterative t-CUR Tensor Completion (ITCURTC) algorithm, specifically designed to tackle the t-CCS-based tensor completion. We have intensively tested and validated the effectiveness of the t-CCS model and the ITCURTC algorithm across both synthetic and real-world datasets. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.04519 [pdf, other]

Multifidelity digital twin for real-time monitoring of structural dynamics in aquaculture net cages

Authors: Eirini Katsidoniotaki, Biao Su, Eleni Kelasidi, Themistoklis P. Sapsis

Abstract: As the global population grows and climate change intensifies, sustainable food production is critical. Marine aquaculture offers a viable solution, providing a sustainable protein source. However, the industry's expansion requires novel technologies for remote management and autonomous operations. Digital twin technology can advance the aquaculture industry, but its adoption has been limited. Fis… ▽ More As the global population grows and climate change intensifies, sustainable food production is critical. Marine aquaculture offers a viable solution, providing a sustainable protein source. However, the industry's expansion requires novel technologies for remote management and autonomous operations. Digital twin technology can advance the aquaculture industry, but its adoption has been limited. Fish net cages, which are flexible floating structures, are critical yet vulnerable components of aquaculture farms. Exposed to harsh and dynamic marine environments, the cages experience significant loads and risk damage, leading to fish escapes, environmental impacts, and financial losses. We propose a multifidelity surrogate modeling framework for integration into a digital twin for real-time monitoring of aquaculture net cage structural dynamics under stochastic marine conditions. Central to this framework is the nonlinear autoregressive Gaussian process method, which learns complex, nonlinear cross-correlations between models of varying fidelity. It combines low-fidelity simulation data with a small set of high-fidelity field sensor measurements, which offer the real dynamics but are costly and spatially sparse. Validated at the SINTEF ACE fish farm in Norway, our digital twin receives online metocean data and accurately predicts net cage displacements and mooring line loads, aligning closely with field measurements. The proposed framework is beneficial where application-specific data are scarce, offering rapid predictions and real-time system representation. The developed digital twin prevents potential damages by assessing structural integrity and facilitates remote operations with unmanned underwater vehicles. Our work also compares GP and GCNs for predicting net cage deformation, highlighting the latter's effectiveness in complex structural applications. △ Less

Submitted 10 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.19654 [pdf, other]

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Authors: **xia Yang, Bing Su, Wayne Xin Zhao, Ji-Rong Wen

Abstract: Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the M… ▽ More Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward map** classification (FMC) and reverse map** regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Accepted at ICML 2024

arXiv:2405.01053 [pdf, other]

Explicitly Modeling Universality into Self-Supervised Learning

Authors: **gyao Wang, Wenwen Qiang, Zeen Song, Lingyu Si, Jiangmeng Li, Changwen Zheng, Bing Su

Abstract: The goal of universality in self-supervised learning (SSL) is to learn universal representations from unlabeled data and achieve excellent performance on all samples and tasks. However, these methods lack explicit modeling of the universality in the learning objective, and the related theoretical understanding remains limited. This may cause models to overfit in data-scarce situations and generali… ▽ More The goal of universality in self-supervised learning (SSL) is to learn universal representations from unlabeled data and achieve excellent performance on all samples and tasks. However, these methods lack explicit modeling of the universality in the learning objective, and the related theoretical understanding remains limited. This may cause models to overfit in data-scarce situations and generalize poorly in real life. To address these issues, we provide a theoretical definition of universality in SSL, which constrains both the learning and evaluation universality of the SSL models from the perspective of discriminability, transferability, and generalization. Then, we propose a $σ$-measurement to help quantify the score of one SSL model's universality. Based on the definition and measurement, we propose a general SSL framework, called GeSSL, to explicitly model universality into SSL. It introduces a self-motivated target based on $σ$-measurement, which enables the model to find the optimal update direction towards universality. Extensive theoretical and empirical evaluations demonstrate the superior performance of GeSSL. △ Less

Submitted 23 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: 28 pages, submitted to ICML24 with 7766

arXiv:2404.15131 [pdf, other]

Optimizing Multi-Touch Textile and Tactile Skin Sensing Through Circuit Parameter Estimation

Authors: Bo Ying Su, Yuchen Wu, Chengtao Wen, Changliu Liu

Abstract: Tactile and textile skin technologies have become increasingly important for enhancing human-robot interaction and allowing robots to adapt to different environments. Despite notable advancements, there are ongoing challenges in skin signal processing, particularly in achieving both accuracy and speed in dynamic touch sensing. This paper introduces a new framework that poses the touch sensing prob… ▽ More Tactile and textile skin technologies have become increasingly important for enhancing human-robot interaction and allowing robots to adapt to different environments. Despite notable advancements, there are ongoing challenges in skin signal processing, particularly in achieving both accuracy and speed in dynamic touch sensing. This paper introduces a new framework that poses the touch sensing problem as an estimation problem of resistive sensory arrays. Utilizing a Regularized Least Squares objective function which estimates the resistance distribution of the skin. We enhance the touch sensing accuracy and mitigate the ghosting effects, where false or misleading touches may be registered. Furthermore, our study presents a streamlined skin design that simplifies manufacturing processes without sacrificing performance. Experimental outcomes substantiate the effectiveness of our method, showing 26.9% improvement in multi-touch force-sensing accuracy for the tactile skin. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.04095 [pdf, other]

Dynamic Prompt Optimizing for Text-to-Image Generation

Authors: Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, Qing Yang

Abstract: Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fin… ▽ More Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment. Code is available at https://github.com/Mowenyii/PAE. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024

arXiv:2404.00340 [pdf, other]

Deep Reinforcement Learning in Autonomous Car Path Planning and Control: A Survey

Authors: Yiyang Chen, Chao Ji, Yunrui Cai, Tong Yan, Bo Su

Abstract: Combining data-driven applications with control systems plays a key role in recent Autonomous Car research. This thesis offers a structured review of the latest literature on Deep Reinforcement Learning (DRL) within the realm of autonomous vehicle Path Planning and Control. It collects a series of DRL methodologies and algorithms and their applications in the field, focusing notably on their roles… ▽ More Combining data-driven applications with control systems plays a key role in recent Autonomous Car research. This thesis offers a structured review of the latest literature on Deep Reinforcement Learning (DRL) within the realm of autonomous vehicle Path Planning and Control. It collects a series of DRL methodologies and algorithms and their applications in the field, focusing notably on their roles in trajectory planning and dynamic control. In this review, we delve into the application outcomes of DRL technologies in this domain. By summarizing these literatures, we highlight potential challenges, aiming to offer insights that might aid researchers engaged in related fields. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2402.09204 [pdf, other]

Domain-adaptive and Subgroup-specific Cascaded Temperature Regression for Out-of-distribution Calibration

Authors: Jiexin Wang, Jiahao Chen, Bing Su

Abstract: Although deep neural networks yield high classification accuracy given sufficient training data, their predictions are typically overconfident or under-confident, i.e., the prediction confidences cannot truly reflect the accuracy. Post-hoc calibration tackles this problem by calibrating the prediction confidences without re-training the classification model. However, current approaches assume cong… ▽ More Although deep neural networks yield high classification accuracy given sufficient training data, their predictions are typically overconfident or under-confident, i.e., the prediction confidences cannot truly reflect the accuracy. Post-hoc calibration tackles this problem by calibrating the prediction confidences without re-training the classification model. However, current approaches assume congruence between test and validation data distributions, limiting their applicability to out-of-distribution scenarios. To this end, we propose a novel meta-set-based cascaded temperature regression method for post-hoc calibration. Our method tailors fine-grained scaling functions to distinct test sets by simulating various domain shifts through data augmentation on the validation set. We partition each meta-set into subgroups based on predicted category and confidence level, capturing diverse uncertainties. A regression network is then trained to derive category-specific and confidence-level-specific scaling, achieving calibration across meta-sets. Extensive experimental results on MNIST, CIFAR-10, and TinyImageNet demonstrate the effectiveness of the proposed method. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Journal ref: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024), Seoul, Korea

arXiv:2401.15566 [pdf, other]

On the Robustness of Cross-Concentrated Sampling for Matrix Completion

Authors: HanQin Cai, Longxiu Huang, Chandra Kundu, Bowen Su

Abstract: Matrix completion is one of the crucial tools in modern data science research. Recently, a novel sampling model for matrix completion coined cross-concentrated sampling (CCS) has caught much attention. However, the robustness of the CCS model against sparse outliers remains unclear in the existing studies. In this paper, we aim to answer this question by exploring a novel Robust CCS Completion pro… ▽ More Matrix completion is one of the crucial tools in modern data science research. Recently, a novel sampling model for matrix completion coined cross-concentrated sampling (CCS) has caught much attention. However, the robustness of the CCS model against sparse outliers remains unclear in the existing studies. In this paper, we aim to answer this question by exploring a novel Robust CCS Completion problem. A highly efficient non-convex iterative algorithm, dubbed Robust CUR Completion (RCURC), is proposed. The empirical performance of the proposed algorithm, in terms of both efficiency and robustness, is verified in synthetic and real datasets. △ Less

Submitted 27 January, 2024; originally announced January 2024.

Comments: 58th Annual Conference of Information Sciences and Systems

arXiv:2310.19973 [pdf, other]

Unified Enhancement of Privacy Bounds for Mixture Mechanisms via $f$-Differential Privacy

Authors: Chendi Wang, Buxin Su, Jiayuan Ye, Reza Shokri, Weijie J. Su

Abstract: Differentially private (DP) machine learning algorithms incur many sources of randomness, such as random initialization, random batch subsampling, and shuffling. However, such randomness is difficult to take into account when proving differential privacy bounds because it induces mixture distributions for the algorithm's output that are difficult to analyze. This paper focuses on improving privacy… ▽ More Differentially private (DP) machine learning algorithms incur many sources of randomness, such as random initialization, random batch subsampling, and shuffling. However, such randomness is difficult to take into account when proving differential privacy bounds because it induces mixture distributions for the algorithm's output that are difficult to analyze. This paper focuses on improving privacy bounds for shuffling models and one-iteration differentially private gradient descent (DP-GD) with random initializations using $f$-DP. We derive a closed-form expression of the trade-off function for shuffling models that outperforms the most up-to-date results based on $(ε,δ)$-DP. Moreover, we investigate the effects of random initialization on the privacy of one-iteration DP-GD. Our numerical computations of the trade-off function indicate that random initialization can enhance the privacy of DP-GD. Our analysis of $f$-DP guarantees for these mixture mechanisms relies on an inequality for trade-off functions introduced in this paper. This inequality implies the joint convexity of $F$-divergences. Finally, we study an $f$-DP analog of the advanced joint convexity of the hockey-stick divergence related to $(ε,δ)$-DP and apply it to analyze the privacy of mixture mechanisms. △ Less

Submitted 1 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2308.06358 [pdf, other]

CA2: Cyber Attacks Analytics

Authors: Luyu Cheng, Bairui Su, Yumeng Xue, Xiaoyu Liu, Yunhai Wang

Abstract: The VAST Challenge 2020 Mini-Challenge 1 requires participants to identify the responsible white hat groups behind a fictional Internet outage. To address this task, we have created a visual analytics system named CA2: Cyber Attacks Analytics. This system is designed to efficiently compare and match subgraphs within an extensive graph containing anonymized profiles. Additionally, we showcase an it… ▽ More The VAST Challenge 2020 Mini-Challenge 1 requires participants to identify the responsible white hat groups behind a fictional Internet outage. To address this task, we have created a visual analytics system named CA2: Cyber Attacks Analytics. This system is designed to efficiently compare and match subgraphs within an extensive graph containing anonymized profiles. Additionally, we showcase an iterative workflow that utilizes our system's capabilities to pinpoint the responsible group. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: IEEE Conference on Visual Analytics Science and Technology (VAST) Challenge Workshop 2020

arXiv:2308.05648 [pdf, other]

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Authors: Zezhong Lv, Bing Su, Ji-Rong Wen

Abstract: Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language i… ▽ More Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}. △ Less

Submitted 14 October, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted by ACM MM 2023

arXiv:2308.03950 [pdf, other]

doi 10.1145/3581783.3611888

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Authors: Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, Jiaqi Wang

Abstract: Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent map** the features to an identical anchor point within t… ▽ More Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent map** the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted by ACM MM 2023

arXiv:2308.03072 [pdf, other]

Customizing Textile and Tactile Skins for Interactive Industrial Robots

Authors: Bo Ying Su, Zhongqi Wei, James McCann, Wenzhen Yuan, Changliu Liu

Abstract: Tactile skins made from textiles enhance robot-human interaction by localizing contact points and measuring contact forces. This paper presents a solution for rapidly fabricating, calibrating, and deploying these skins on industrial robot arms. The novel automated skin calibration procedure maps skin locations to robot geometry and calibrates contact force. Through experiments on a FANUC LR Mate 2… ▽ More Tactile skins made from textiles enhance robot-human interaction by localizing contact points and measuring contact forces. This paper presents a solution for rapidly fabricating, calibrating, and deploying these skins on industrial robot arms. The novel automated skin calibration procedure maps skin locations to robot geometry and calibrates contact force. Through experiments on a FANUC LR Mate 200id/7L industrial robot, we demonstrate that tactile skins made from textiles can be effectively used for human-robot interaction in industrial environments, and can provide unique opportunities in robot control and learning, making them a promising technology for enhancing robot perception and interaction. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2308.01850 [pdf, other]

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Authors: Zhao Yang, Bing Su, Ji-Rong Wen

Abstract: Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face… ▽ More Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at \href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: Accepted at ACM MM 2023

arXiv:2308.01097 [pdf, other]

doi 10.1145/3581783.3612330

Spatio-Temporal Branching for Motion Prediction using Motion Increments

Authors: Jiexin Wang, Yujie Zhou, Wenwen Qiang, Ying Ba, Bing Su, Ji-Rong Wen

Abstract: Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications, but it remains a challenging task due to the stochastic and aperiodic nature of future poses. Traditional methods rely on hand-crafted features and machine learning techniques, which often struggle to model the complex dynamics of human motion. Recent deep learning-based methods have achieved suc… ▽ More Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications, but it remains a challenging task due to the stochastic and aperiodic nature of future poses. Traditional methods rely on hand-crafted features and machine learning techniques, which often struggle to model the complex dynamics of human motion. Recent deep learning-based methods have achieved success by learning spatio-temporal representations of motion, but these models often overlook the reliability of motion data. Additionally, the temporal and spatial dependencies of skeleton nodes are distinct. The temporal relationship captures motion information over time, while the spatial relationship describes body structure and the relationships between different nodes. In this paper, we propose a novel spatio-temporal branching network using incremental information for HMP, which decouples the learning of temporal-domain and spatial-domain features, extracts more motion information, and achieves complementary cross-domain knowledge learning through knowledge distillation. Our approach effectively reduces noise interference and provides more expressive information for characterizing motion by separately extracting temporal and spatial features. We evaluate our approach on standard HMP benchmarks and outperform state-of-the-art methods in terms of prediction accuracy. △ Less

Submitted 11 August, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

Journal ref: ACM MM 2023

arXiv:2305.12618 [pdf, other]

Atomic and Subgraph-aware Bilateral Aggregation for Molecular Representation Learning

Authors: Jiahao Chen, Yurou Liu, Jiangmeng Li, Bing Su, Jirong Wen

Abstract: Molecular representation learning is a crucial task in predicting molecular properties. Molecules are often modeled as graphs where atoms and chemical bonds are represented as nodes and edges, respectively, and Graph Neural Networks (GNNs) have been commonly utilized to predict atom-related properties, such as reactivity and solubility. However, functional groups (subgraphs) are closely related to… ▽ More Molecular representation learning is a crucial task in predicting molecular properties. Molecules are often modeled as graphs where atoms and chemical bonds are represented as nodes and edges, respectively, and Graph Neural Networks (GNNs) have been commonly utilized to predict atom-related properties, such as reactivity and solubility. However, functional groups (subgraphs) are closely related to some chemical properties of molecules, such as efficacy, and metabolic properties, which cannot be solely determined by individual atoms. In this paper, we introduce a new model for molecular representation learning called the Atomic and Subgraph-aware Bilateral Aggregation (ASBA), which addresses the limitations of previous atom-wise and subgraph-wise models by incorporating both types of information. ASBA consists of two branches, one for atom-wise information and the other for subgraph-wise information. Considering existing atom-wise GNNs cannot properly extract invariant subgraph features, we propose a decomposition-polymerization GNN architecture for the subgraph-wise branch. Furthermore, we propose cooperative node-level and graph-level self-supervised learning strategies for ASBA to improve its generalization. Our method offers a more comprehensive way to learn representations for molecular property prediction and has broad potential in drug and material discovery applications. Extensive experiments have demonstrated the effectiveness of our method. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2304.08288 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095211

Toward Auto-evaluation with Confidence-based Category Relation-aware Regression

Authors: Jiexin Wang, Jiahao Chen, Bing Su

Abstract: Auto-evaluation aims to automatically evaluate a trained model on any test dataset without human annotations. Most existing methods utilize global statistics of features extracted by the model as the representation of a dataset. This ignores the influence of the classification head and loses category-wise confusion information of the model. However, ratios of instances assigned to different catego… ▽ More Auto-evaluation aims to automatically evaluate a trained model on any test dataset without human annotations. Most existing methods utilize global statistics of features extracted by the model as the representation of a dataset. This ignores the influence of the classification head and loses category-wise confusion information of the model. However, ratios of instances assigned to different categories together with their confidence scores reflect how many instances in which categories are difficult for the model to classify, which contain significant indicators for both overall and category-wise performances. In this paper, we propose a Confidence-based Category Relation-aware Regression ($C^2R^2$) method. $C^2R^2$ divides all instances in a meta-set into different categories according to their confidence scores and extracts the global representation from them. For each category, $C^2R^2$ encodes its local confusion relations to other categories into a local representation. The overall and category-wise performances are regressed from global and local representations, respectively. Extensive experiments show the effectiveness of our method. △ Less

Submitted 9 May, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5

arXiv:2304.06537 [pdf, other]

Transfer Knowledge from Head to Tail: Uncertainty Calibration under Long-tailed Distribution

Authors: Jiahao Chen, Bing Su

Abstract: How to estimate the uncertainty of a given model is a crucial problem. Current calibration techniques treat different classes equally and thus implicitly assume that the distribution of training data is balanced, but ignore the fact that real-world data often follows a long-tailed distribution. In this paper, we explore the problem of calibrating the model trained from a long-tailed distribution.… ▽ More How to estimate the uncertainty of a given model is a crucial problem. Current calibration techniques treat different classes equally and thus implicitly assume that the distribution of training data is balanced, but ignore the fact that real-world data often follows a long-tailed distribution. In this paper, we explore the problem of calibrating the model trained from a long-tailed distribution. Due to the difference between the imbalanced training distribution and balanced test distribution, existing calibration methods such as temperature scaling can not generalize well to this problem. Specific calibration methods for domain adaptation are also not applicable because they rely on unlabeled target domain instances which are not available. Models trained from a long-tailed distribution tend to be more overconfident to head classes. To this end, we propose a novel knowledge-transferring-based calibration method by estimating the importance weights for samples of tail classes to realize long-tailed calibration. Our method models the distribution of each class as a Gaussian distribution and views the source statistics of head classes as a prior to calibrate the target distributions of tail classes. We adaptively transfer knowledge from head classes to get the target probability density of tail classes. The importance weight is estimated by the ratio of the target probability density over the source probability density. Extensive experiments on CIFAR-10-LT, MNIST-LT, CIFAR-100-LT, and ImageNet-LT datasets demonstrate the effectiveness of our method. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2302.09018 [pdf, other]

doi 10.1609/aaai.v37i3.25495

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

Authors: Yujie Zhou, Haodong Duan, Anyi Rao, Bing Su, Jiaqi Wang

Abstract: Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleton sequence for contrastive learning. However, due to the rich action clues in the skeleton sequences, existing methods may only take a global perspective to lear… ▽ More Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleton sequence for contrastive learning. However, due to the rich action clues in the skeleton sequences, existing methods may only take a global perspective to learn to discriminate different skeletons without thoroughly leveraging the local relationship between different skeleton joints and video frames, which is essential for real-world applications. In this work, we propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship from a partial skeleton sequences built by a unique spatio-temporal masking strategy. Specifically, we construct a negative-sample-free triplet steam structure that is composed of an anchor stream without any masking, a spatial masking stream with Central Spatial Masking (CSM), and a temporal masking stream with Motion Attention Temporal Masking (MATM). The feature cross-correlation matrix is measured between the anchor stream and the other two masking streams, respectively. (1) Central Spatial Masking discards selected joints from the feature calculation process, where the joints with a higher degree of centrality have a higher possibility of being selected. (2) Motion Attention Temporal Masking leverages the motion of action and remove frames that move faster with a higher possibility. Our method achieves SOTA performance on NTU-60, NTU-120 and PKU-MMD under various downstream tasks. A practical evaluation is performed where some skeleton joints are lost in downstream tasks. In contrast to previous methods that suffer from large performance drops, our PSTL can still achieve remarkable results, validating the robustness of our method. Code: https://github.com/YujieOuO/PSTL.git. △ Less

Submitted 22 February, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

Comments: Accepted by AAAI 2023(Oral)

arXiv:2302.05787 [pdf, other]

Differentially Private Normalizing Flows for Density Estimation, Data Synthesis, and Variational Inference with Application to Electronic Health Records

Authors: Bingyue Su, Yu Wang, Daniele E. Schiavazzi, Fang Liu

Abstract: Electronic health records (EHR) often contain sensitive medical information about individual patients, posing significant limitations to sharing or releasing EHR data for downstream learning and inferential tasks. We use normalizing flows (NF), a family of deep generative models, to estimate the probability density of a dataset with differential privacy (DP) guarantees, from which privacy-preservi… ▽ More Electronic health records (EHR) often contain sensitive medical information about individual patients, posing significant limitations to sharing or releasing EHR data for downstream learning and inferential tasks. We use normalizing flows (NF), a family of deep generative models, to estimate the probability density of a dataset with differential privacy (DP) guarantees, from which privacy-preserving synthetic data are generated. We apply the technique to an EHR dataset containing patients with pulmonary hypertension. We assess the learning and inferential utility of the synthetic data by comparing the accuracy in the prediction of the hypertension status and variational posterior distribution of the parameters of a physics-based model. In addition, we use a simulated dataset from a nonlinear model to compare the results from variational inference (VI) based on privacy-preserving synthetic data, and privacy-preserving VI obtained from directly privatizing NFs for VI with DP guarantees given the original non-private dataset. The results suggest that synthetic data generated through differentially private density estimation with NF can yield good utility at a reasonable privacy cost. We also show that VI obtained from differentially private NF based on the free energy bound loss may produce variational approximations with significantly altered correlation structure, and loss formulations based on alternative dissimilarity metrics between two distributions might provide improved results. △ Less

Submitted 11 February, 2023; originally announced February 2023.

arXiv:2210.15996 [pdf, other]

Towards Generalized Few-Shot Open-Set Object Detection

Authors: Binyi Su, Hua Zhang, **gzhi Li, Zhong Zhou

Abstract: Open-set object detection (OSOD) aims to detect the known categories and reject unknown objects in a dynamic world, which has achieved significant attention. However, previous approaches only consider this problem in data-abundant conditions, while neglecting the few-shot scenes. In this paper, we seek a solution for the generalized few-shot open-set object detection (G-FOOD), which aims to avoid… ▽ More Open-set object detection (OSOD) aims to detect the known categories and reject unknown objects in a dynamic world, which has achieved significant attention. However, previous approaches only consider this problem in data-abundant conditions, while neglecting the few-shot scenes. In this paper, we seek a solution for the generalized few-shot open-set object detection (G-FOOD), which aims to avoid detecting unknown classes as known classes with a high confidence score while maintaining the performance of few-shot detection. The main challenge for this task is that few training samples induce the model to overfit on the known classes, resulting in a poor open-set performance. We propose a new G-FOOD algorithm to tackle this issue, named \underline{F}ew-sh\underline{O}t \underline{O}pen-set \underline{D}etector (FOOD), which contains a novel class weight sparsification classifier (CWSC) and a novel unknown decoupling learner (UDL). To prevent over-fitting, CWSC randomly sparses parts of the normalized weights for the logit prediction of all classes, and then decreases the co-adaptability between the class and its neighbors. Alongside, UDL decouples training the unknown class and enables the model to form a compact unknown decision boundary. Thus, the unknown objects can be identified with a confidence probability without any threshold, prototype, or generation. We compare our method with several state-of-the-art OSOD methods in few-shot scenes and observe that our method improves the F-score of unknown classes by 4.80\%-9.08\% across all shots in VOC-COCO dataset settings \footnote[1]{The source code is available at \url{https://github.com/binyisu/food}}. △ Less

Submitted 21 February, 2024; v1 submitted 28 October, 2022; originally announced October 2022.

arXiv:2210.13446 [pdf, other]

Flying Trot Control Method for Quadruped Robot Based on Trajectory Planning

Authors: Hongge Wang, Hui Chai, Bin Chen, Aizhen Xie, Rui Song, Bo Su

Abstract: An intuitive control method for the flying trot, which combines offline trajectory planning with real-time balance control, is presented. The motion features of running animals in the vertical direction were analysed using the spring-load-inverted-pendulum (SLIP) model, and the foot trajectory of the robot was planned, so the robot could run similar to an animal capable of vertical flight, accordi… ▽ More An intuitive control method for the flying trot, which combines offline trajectory planning with real-time balance control, is presented. The motion features of running animals in the vertical direction were analysed using the spring-load-inverted-pendulum (SLIP) model, and the foot trajectory of the robot was planned, so the robot could run similar to an animal capable of vertical flight, according to the given height and speed of the trunk. To improve the robustness of running, a posture control method based on a foot acceleration adjustment is proposed. A novel kinematic based CoM observation method and CoM regulation method is present to enhance the stability of locomotion. To reduce the impact force when the robot interacts with the environment, the virtual model control method is used in the control of the foot trajectory to achieve active compliance. By selecting the proper parameters for the virtual model, the oscillation motion of the virtual model and the planning motion of the support foot are synchronized to avoid the large disturbance caused by the oscillation motion of the virtual model in relation to the robot motion. The simulation and experiment using the quadruped robot Billy are reported. In the experiment, the maximum speed of the robot could reach 4.73 times the body length per second, which verified the feasibility of the control method. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: 30 pages, 20 figures, journal

arXiv:2209.07902 [pdf, other]

MetaMask: Revisiting Dimensional Confounder for Self-Supervised Learning

Authors: Jiangmeng Li, Wenwen Qiang, Yanan Zhang, Wenyi Mo, Changwen Zheng, Bing Su, Hui Xiong

Abstract: As a successful approach to self-supervised learning, contrastive learning aims to learn invariant information shared among distortions of the input sample. While contrastive learning has yielded continuous advancements in sampling strategy and architecture design, it still remains two persistent defects: the interference of task-irrelevant information and sample inefficiency, which are related to… ▽ More As a successful approach to self-supervised learning, contrastive learning aims to learn invariant information shared among distortions of the input sample. While contrastive learning has yielded continuous advancements in sampling strategy and architecture design, it still remains two persistent defects: the interference of task-irrelevant information and sample inefficiency, which are related to the recurring existence of trivial constant solutions. From the perspective of dimensional analysis, we find out that the dimensional redundancy and dimensional confounder are the intrinsic issues behind the phenomena, and provide experimental evidence to support our viewpoint. We further propose a simple yet effective approach MetaMask, short for the dimensional Mask learned by Meta-learning, to learn representations against dimensional redundancy and confounder. MetaMask adopts the redundancy-reduction technique to tackle the dimensional redundancy issue and innovatively introduces a dimensional mask to reduce the gradient effects of specific dimensions containing the confounder, which is trained by employing a meta-learning paradigm with the objective of improving the performance of masked representations on a typical self-supervised task. We provide solid theoretical analyses to prove MetaMask can obtain tighter risk bounds for downstream classification compared to typical contrastive methods. Empirically, our method achieves state-of-the-art performance on various benchmarks. △ Less

Submitted 9 August, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: Accepted by NeurIPS 2022 as Spotlight

arXiv:2209.07811 [pdf, other]

doi 10.1109/TKDE.2022.3198746

Modeling Multiple Views via Implicitly Preserving Global Consistency and Local Complementarity

Authors: Jiangmeng Li, Wenwen Qiang, Changwen Zheng, Bing Su, Farid Razzak, Ji-Rong Wen, Hui Xiong

Abstract: While self-supervised learning techniques are often used to mining implicit knowledge from unlabeled data via modeling multiple views, it is unclear how to perform effective representation learning in a complex and inconsistent context. To this end, we propose a methodology, specifically consistency and complementarity network (CoCoNet), which avails of strict global inter-view consistency and loc… ▽ More While self-supervised learning techniques are often used to mining implicit knowledge from unlabeled data via modeling multiple views, it is unclear how to perform effective representation learning in a complex and inconsistent context. To this end, we propose a methodology, specifically consistency and complementarity network (CoCoNet), which avails of strict global inter-view consistency and local cross-view complementarity preserving regularization to comprehensively learn representations from multiple views. On the global stage, we reckon that the crucial knowledge is implicitly shared among views, and enhancing the encoder to capture such knowledge from data can improve the discriminability of the learned representations. Hence, preserving the global consistency of multiple views ensures the acquisition of common knowledge. CoCoNet aligns the probabilistic distribution of views by utilizing an efficient discrepancy metric measurement based on the generalized sliced Wasserstein distance. Lastly on the local stage, we propose a heuristic complementarity-factor, which joints cross-view discriminative knowledge, and it guides the encoders to learn not only view-wise discriminability but also cross-view complementary information. Theoretically, we provide the information-theoretical-based analyses of our proposed CoCoNet. Empirically, to investigate the improvement gains of our approach, we conduct adequate experimental validations, which demonstrate that CoCoNet outperforms the state-of-the-art self-supervised methods by a significant margin proves that such implicit consistency and complementarity preserving regularization can enhance the discriminability of latent representations. △ Less

Submitted 9 August, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE) 2022; Refer to https://ieeexplore.ieee.org/document/9857632

arXiv:2209.06419 [pdf, other]

Frequency Reversal Alamouti Code-Based FBMC with Resilience to Inter-Antenna Frequency Offsets

Authors: Cheng-Yu Lin, Borching Su, Kwonhue Choi

Abstract: Transmit diversity schemes for filter bank multicarrier (FBMC) are known to be challenging. No existing schemes have considered the presence of inter-antenna frequency offset (IAFO), which will result in performance degradation. In this letter, a new transmit scheme based on the frequency reversal Alamouti code (FRAC)-based structure to address the issue of IAFO is proposed and is proven to inhere… ▽ More Transmit diversity schemes for filter bank multicarrier (FBMC) are known to be challenging. No existing schemes have considered the presence of inter-antenna frequency offset (IAFO), which will result in performance degradation. In this letter, a new transmit scheme based on the frequency reversal Alamouti code (FRAC)-based structure to address the issue of IAFO is proposed and is proven to inherently cancel the inter-antenna inter-carrier interference (ICI) while preserving spatial diversity. Moreover, the proposed FRAC structure is applicable in frequency-selective channels. Numerical results show that the proposed scheme undergoes negligible bit error rate (BER) degradation even with considerable IAFOs. △ Less

Submitted 14 September, 2022; originally announced September 2022.

arXiv:2209.05481 [pdf, other]

A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

Authors: Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, Ji-Rong Wen

Abstract: Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their unders… ▽ More Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language descriptions. We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine, among others. △ Less

Submitted 11 September, 2022; originally announced September 2022.

arXiv:2207.02454 [pdf, other]

Ordinal Regression via Binary Preference vs Simple Regression: Statistical and Experimental Perspectives

Authors: Bin Su, Shaoguang Mao, Frank Soong, Zhiyong Wu

Abstract: Ordinal regression with anchored reference samples (ORARS) has been proposed for predicting the subjective Mean Opinion Score (MOS) of input stimuli automatically. The ORARS addresses the MOS prediction problem by pairing a test sample with each of the pre-scored anchored reference samples. A trained binary classifier is then used to predict which sample, test or anchor, is better statistically. P… ▽ More Ordinal regression with anchored reference samples (ORARS) has been proposed for predicting the subjective Mean Opinion Score (MOS) of input stimuli automatically. The ORARS addresses the MOS prediction problem by pairing a test sample with each of the pre-scored anchored reference samples. A trained binary classifier is then used to predict which sample, test or anchor, is better statistically. Posteriors of the binary preference decision are then used to predict the MOS of the test sample. In this paper, rigorous framework, analysis, and experiments to demonstrate that ORARS are advantageous over simple regressions are presented. The contributions of this work are: 1) Show that traditional regression can be reformulated into multiple preference tests to yield a better performance, which is confirmed with simulations experimentally; 2) Generalize ORARS to other regression problems and verify its effectiveness; 3) Provide some prerequisite conditions which can insure proper application of ORARS. △ Less

Submitted 6 July, 2022; originally announced July 2022.

arXiv:2206.14702 [pdf, other]

Interventional Contrastive Learning with Meta Semantic Regularizer

Authors: Wenwen Qiang, Jiangmeng Li, Changwen Zheng, Bing Su, Hui Xiong

Abstract: Contrastive learning (CL)-based self-supervised learning models learn visual representations in a pairwise manner. Although the prevailing CL model has achieved great progress, in this paper, we uncover an ever-overlooked phenomenon: When the CL model is trained with full images, the performance tested in full images is better than that in foreground areas; when the CL model is trained with foregr… ▽ More Contrastive learning (CL)-based self-supervised learning models learn visual representations in a pairwise manner. Although the prevailing CL model has achieved great progress, in this paper, we uncover an ever-overlooked phenomenon: When the CL model is trained with full images, the performance tested in full images is better than that in foreground areas; when the CL model is trained with foreground areas, the performance tested in full images is worse than that in foreground areas. This observation reveals that backgrounds in images may interfere with the model learning semantic information and their influence has not been fully eliminated. To tackle this issue, we build a Structural Causal Model (SCM) to model the background as a confounder. We propose a backdoor adjustment-based regularization method, namely Interventional Contrastive Learning with Meta Semantic Regularizer (ICL-MSR), to perform causal intervention towards the proposed SCM. ICL-MSR can be incorporated into any existing CL methods to alleviate background distractions from representation learning. Theoretically, we prove that ICL-MSR achieves a tighter error bound. Empirically, our experiments on multiple benchmark datasets demonstrate that ICL-MSR is able to improve the performances of different state-of-the-art CL methods. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted by ICML 2022

arXiv:2206.10207 [pdf, other]

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

Authors: Gang Li, Heliang Zheng, Daqing Liu, Chaoyue Wang, Bing Su, Changwen Zheng

Abstract: Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the trai… ▽ More Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy. Compared to widely adopted random masking, our masking strategy can gradually guide the network to learn various information, i.e., from intra-part patterns to inter-part relations. In particular, we achieve this in two steps. 1) Semantic part learning: we design a self-supervised part learning method to obtain semantic parts by leveraging and refining the multi-head attention of a ViT-based encoder. 2) Semantic-guided MAE (SemMAE) training: we design a masking strategy that varies from masking a portion of patches in each part to masking a portion of (whole) parts in an image. Extensive experiments on various vision tasks show that SemMAE can learn better image representation by integrating semantic information. In particular, SemMAE achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition tasks, SemMAE also brings significant improvements and yields the state-of-the-art performance. △ Less

Submitted 5 October, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: Accepted by NeurIPS 2022

arXiv:2205.14407 [pdf, ps, other]

An efficient polynomial-time approximation scheme for parallel multi-stage open shops

Authors: Jianming Dong, Ruyan **, Guohui Lin, Bing Su, Weitian Tong, Yao Xu

Abstract: Various new scheduling problems have been arising from practical production processes and spawning new research areas in the scheduling field. We study the parallel multi-stage open shops problem, which generalizes the classic open shop scheduling and parallel machine scheduling problems. Given m identical k-stage open shops and a set of n jobs, we aim to process all jobs on these open shops with… ▽ More Various new scheduling problems have been arising from practical production processes and spawning new research areas in the scheduling field. We study the parallel multi-stage open shops problem, which generalizes the classic open shop scheduling and parallel machine scheduling problems. Given m identical k-stage open shops and a set of n jobs, we aim to process all jobs on these open shops with the minimum makespan, i.e., the completion time of the last job, under the constraint that job preemption is not allowed. We present an efficient polynomial-time approximation scheme (EPTAS) for the case when both m and k are constant. The main idea for our EPTAS is the combination of several categorization, scaling, and linear programming rounding techniques. Jobs and/or operations are first scaled and then categorized carefully into multiple types so that different types of jobs and/or operations are scheduled appropriately without increasing the makespan too much. △ Less

Submitted 28 May, 2022; originally announced May 2022.

arXiv:2205.13425 [pdf, other]

Do we really need temporal convolutions in action segmentation?

Authors: Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

Abstract: Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based mo… ▽ More Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based models with adaptable and sequence modeling capabilities have recently been used in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating temporal sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture reduces complexity while introducing an inductive bias that adjacent frames are more likely to belong to the same class, but the introduction of coarse resolutions results in the misclassification of boundaries. We observe that the similarity distribution between a boundary frame and its neighboring frames depends on whether the boundary frame is the start or end of an action segment. Therefore, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to enhance the ability to recognize boundaries. Extensive experiments show the effectiveness of our model. △ Less

Submitted 22 November, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2205.11100 [pdf, other]

Supporting Vision-Language Model Inference with Confounder-pruning Knowledge Prompt

Authors: Jiangmeng Li, Wenyi Mo, Wenwen Qiang, Bing Su, Changwen Zheng, Hui Xiong, Ji-Rong Wen

Abstract: Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open-set visual concepts. To boost the transferability of the pre-trained models, recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language describing task-relevant categories, to reduce the gap between tasks in the training and test phases. How… ▽ More Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open-set visual concepts. To boost the transferability of the pre-trained models, recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language describing task-relevant categories, to reduce the gap between tasks in the training and test phases. However, how and what prompts can improve inference performance remains unclear. In this paper, we explicitly clarify the importance of including semantic information in prompts, while existing prompting methods generate prompts without exploring the semantic information of textual labels. Manually constructing prompts with rich semantics requires domain expertise and is extremely time-consuming. To cope with this issue, we propose a semantic-aware prompt learning method, namely CPKP, which retrieves an ontological knowledge graph by treating the textual label as a query to extract task-relevant semantic information. CPKP further introduces a double-tier confounder-pruning procedure to refine the derived semantic information. The graph-tier confounders are gradually identified and phased out, inspired by the principle of Granger causality. The feature-tier confounders are demolished by following the maximum entropy principle in information theory. Empirically, the evaluations demonstrate the effectiveness of CPKP, e.g., with two shots, CPKP outperforms the manual-prompt method by 4.64% and the learnable-prompt method by 1.09% on average, and the superiority of CPKP in domain generalization compared to benchmark approaches. Our implementation is available at https://github.com/Mowenyii/CPKP. △ Less

Submitted 23 March, 2024; v1 submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.09669 [pdf, other]

Semi-WTC: A Practical Semi-supervised Framework for Attack Categorization through Weight-Task Consistency

Authors: Zihan Li, Wentao Chen, Zhiqing Wei, Xingqi Luo, Bing Su

Abstract: Supervised learning has been widely used for attack categorization, requiring high-quality data and labels. However, the data is often imbalanced and it is difficult to obtain sufficient annotations. Moreover, supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. To tackle the challenges, we propose a semi-supervised fine-grained attack… ▽ More Supervised learning has been widely used for attack categorization, requiring high-quality data and labels. However, the data is often imbalanced and it is difficult to obtain sufficient annotations. Moreover, supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. To tackle the challenges, we propose a semi-supervised fine-grained attack categorization framework consisting of an encoder and a two-branch structure and this framework can be generalized to different supervised models. The multilayer perceptron with residual connection is used as the encoder to extract features and reduce the complexity. The Recurrent Prototype Module (RPM) is proposed to train the encoder effectively in a semi-supervised manner. To alleviate the data imbalance problem, we introduce the Weight-Task Consistency (WTC) into the iterative process of RPM by assigning larger weights to classes with fewer samples in the loss function. In addition, to cope with new attacks in real-world deployment, we propose an Active Adaption Resampling (AAR) method, which can better discover the distribution of unseen sample data and adapt the parameters of encoder. Experimental results show that our model outperforms the state-of-the-art semi-supervised attack detection methods with a 3% improvement in classification accuracy and a 90% reduction in training time. △ Less

Submitted 2 September, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Comments: Tech report

arXiv:2203.05119 [pdf, other]

MetAug: Contrastive Learning via Meta Feature Augmentation

Authors: Jiangmeng Li, Wenwen Qiang, Changwen Zheng, Bing Su, Hui Xiong

Abstract: What matters for contrastive learning? We argue that contrastive learning heavily relies on informative features, or "hard" (positive or negative) features. Early works include more informative features by applying complex data augmentations and large batch size or memory bank, and recent works design elaborate sampling approaches to explore informative features. The key challenge toward exploring… ▽ More What matters for contrastive learning? We argue that contrastive learning heavily relies on informative features, or "hard" (positive or negative) features. Early works include more informative features by applying complex data augmentations and large batch size or memory bank, and recent works design elaborate sampling approaches to explore informative features. The key challenge toward exploring such features is that the source multi-view data is generated by applying random data augmentations, making it infeasible to always add useful information in the augmented data. Consequently, the informativeness of features learned from such augmented data is limited. In response, we propose to directly augment the features in latent space, thereby learning discriminative representations without a large amount of input data. We perform a meta learning technique to build the augmentation generator that updates its network parameters by considering the performance of the encoder. However, insufficient input data may lead the encoder to learn collapsed features and therefore malfunction the augmentation generator. A new margin-injected regularization is further added in the objective function to avoid the encoder learning a degenerate map**. To contrast all features in one gradient back-propagation step, we adopt the proposed optimization-driven unified contrastive loss instead of the conventional contrastive loss. Empirically, our method achieves state-of-the-art results on several benchmark datasets. △ Less

Submitted 9 August, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Accepted by ICML 2022

arXiv:2203.04951 [pdf, other]

Learning from Physical Human Feedback: An Object-Centric One-Shot Adaptation Method

Authors: Alvin Shek, Bo Ying Su, Rui Chen, Changliu Liu

Abstract: For robots to be effectively deployed in novel environments and tasks, they must be able to understand the feedback expressed by humans during intervention. This can either correct undesirable behavior or indicate additional preferences. Existing methods either require repeated episodes of interactions or assume prior known reward features, which is data-inefficient and can hardly transfer to new… ▽ More For robots to be effectively deployed in novel environments and tasks, they must be able to understand the feedback expressed by humans during intervention. This can either correct undesirable behavior or indicate additional preferences. Existing methods either require repeated episodes of interactions or assume prior known reward features, which is data-inefficient and can hardly transfer to new tasks. We relax these assumptions by describing human tasks in terms of object-centric sub-tasks and interpreting physical interventions in relation to specific objects. Our method, Object Preference Adaptation (OPA), is composed of two key stages: 1) pre-training a base policy to produce a wide variety of behaviors, and 2) online-updating according to human feedback. The key to our fast, yet simple adaptation is that general interaction dynamics between agents and objects are fixed, and only object-specific preferences are updated. Our adaptation occurs online, requires only one human intervention (one-shot), and produces new behaviors never seen during training. Trained on cheap synthetic data instead of expensive human demonstrations, our policy correctly adapts to human perturbations on realistic tasks on a physical 7DOF robot. Videos, code, and supplementary material are provided. △ Less

Submitted 2 June, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Accepted to ICRA 2023

arXiv:2203.04156 [pdf, other]

doi 10.1109/TKDE.2021.3112815

Robust Local Preserving and Global Aligning Network for Adversarial Domain Adaptation

Authors: Wenwen Qiang, Jiangmeng Li, Changwen Zheng, Bing Su, Hui Xiong

Abstract: Unsupervised domain adaptation (UDA) requires source domain samples with clean ground truth labels during training. Accurately labeling a large number of source domain samples is time-consuming and laborious. An alternative is to utilize samples with noisy labels for training. However, training with noisy labels can greatly reduce the performance of UDA. In this paper, we address the problem that… ▽ More Unsupervised domain adaptation (UDA) requires source domain samples with clean ground truth labels during training. Accurately labeling a large number of source domain samples is time-consuming and laborious. An alternative is to utilize samples with noisy labels for training. However, training with noisy labels can greatly reduce the performance of UDA. In this paper, we address the problem that learning UDA models only with access to noisy labels and propose a novel method called robust local preserving and global aligning network (RLPGA). RLPGA improves the robustness of the label noise from two aspects. One is learning a classifier by a robust informative-theoretic-based loss function. The other is constructing two adjacency weight matrices and two negative weight matrices by the proposed local preserving module to preserve the local topology structures of input data. We conduct theoretical analysis on the robustness of the proposed RLPGA and prove that the robust informative-theoretic-based loss and the local preserving module are beneficial to reduce the empirical risk of the target domain. A series of empirical studies show the effectiveness of our proposed RLPGA. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE) 2022; Refer to https://ieeexplore.ieee.org/document/9540279

arXiv:2202.11356 [pdf, other]

Preformer: Predictive Transformer with Multi-Scale Segment-wise Correlations for Long-Term Time Series Forecasting

Authors: Dazhao Du, Bing Su, Zhewei Wei

Abstract: Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since t… ▽ More Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods. △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2201.10471 [pdf, other]

GIU-GANs: Global Information Utilization for Generative Adversarial Networks

Authors: Yongqi Tian, Xueyuan Gong, Jialin Tang, Binghua Su, Xiaoxiang Liu, Xinyuan Zhang

Abstract: In recent years, with the rapid development of artificial intelligence, image generation based on deep learning has dramatically advanced. Image generation based on Generative Adversarial Networks (GANs) is a promising study. However, since convolutions are limited by spatial-agnostic and channel-specific, features extracted by traditional GANs based on convolution are constrained. Therefore, GANs… ▽ More In recent years, with the rapid development of artificial intelligence, image generation based on deep learning has dramatically advanced. Image generation based on Generative Adversarial Networks (GANs) is a promising study. However, since convolutions are limited by spatial-agnostic and channel-specific, features extracted by traditional GANs based on convolution are constrained. Therefore, GANs are unable to capture any more details per image. On the other hand, straightforwardly stacking of convolutions causes too many parameters and layers in GANs, which will lead to a high risk of overfitting. To overcome the aforementioned limitations, in this paper, we propose a new GANs called Involution Generative Adversarial Networks (GIU-GANs). GIU-GANs leverages a brand new module called the Global Information Utilization (GIU) module, which integrates Squeeze-and-Excitation Networks (SENet) and involution to focus on global information by channel attention mechanism, leading to a higher quality of generated images. Meanwhile, Batch Normalization(BN) inevitably ignores the representation differences among noise sampled by the generator, and thus degrade the generated image quality. Thus we introduce Representative Batch Normalization(RBN) to the GANs architecture for this issue. The CIFAR-10 and CelebA datasets are employed to demonstrate the effectiveness of our proposed model. A large number of experiments prove that our model achieves state-of-the-art competitive performance. △ Less

Submitted 15 March, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

arXiv:2201.06125 [pdf, other]

Temporal Relation Extraction with a Graph-Based Deep Biaffine Attention Model

Authors: Bo-Ying Su, Shang-Ling Hsu, Kuan-Yin Lai, Amarnath Gupta

Abstract: Temporal information extraction plays a critical role in natural language understanding. Previous systems have incorporated advanced neural language models and have successfully enhanced the accuracy of temporal information extraction tasks. However, these systems have two major shortcomings. First, they fail to make use of the two-sided nature of temporal relations in prediction. Second, they inv… ▽ More Temporal information extraction plays a critical role in natural language understanding. Previous systems have incorporated advanced neural language models and have successfully enhanced the accuracy of temporal information extraction tasks. However, these systems have two major shortcomings. First, they fail to make use of the two-sided nature of temporal relations in prediction. Second, they involve non-parallelizable pipelines in inference process that bring little performance gain. To this end, we propose a novel temporal information extraction model based on deep biaffine attention to extract temporal relationships between events in unstructured text efficiently and accurately. Our model is performant because we perform relation extraction tasks directly instead of considering event annotation as a prerequisite of relation extraction. Moreover, our architecture uses Multilayer Perceptrons (MLP) with biaffine attention to predict arcs and relation labels separately, improving relation detecting accuracy by exploiting the two-sided nature of temporal relationships. We experimentally demonstrate that our model achieves state-of-the-art performance in temporal relation extraction. △ Less

Submitted 16 January, 2022; originally announced January 2022.

arXiv:2112.00894 [pdf, other]

Context-Dependent Semantic Parsing for Temporal Relation Extraction

Authors: Bo-Ying Su, Shang-Ling Hsu, Kuan-Yin Lai, Jane Yung-jen Hsu

Abstract: Extracting temporal relations among events from unstructured text has extensive applications, such as temporal reasoning and question answering. While it is difficult, recent development of Neural-symbolic methods has shown promising results on solving similar tasks. Current temporal relation extraction methods usually suffer from limited expressivity and inconsistent relation inference. For examp… ▽ More Extracting temporal relations among events from unstructured text has extensive applications, such as temporal reasoning and question answering. While it is difficult, recent development of Neural-symbolic methods has shown promising results on solving similar tasks. Current temporal relation extraction methods usually suffer from limited expressivity and inconsistent relation inference. For example, in TimeML annotations, the concept of intersection is absent. Additionally, current methods do not guarantee the consistency among the predicted annotations. In this work, we propose SMARTER, a neural semantic parser, to extract temporal information in text effectively. SMARTER parses natural language to an executable logical form representation, based on a custom typed lambda calculus. In the training phase, dynamic programming on denotations (DPD) technique is used to provide weak supervision on logical forms. In the inference phase, SMARTER generates a temporal relation graph by executing the logical form. As a result, our neural semantic parser produces logical forms capturing the temporal information of text precisely. The accurate logical form representations of an event given the context ensure the correctness of the extracted relations. △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2110.09924 [pdf, ps, other]

Speech Enhancement Based on Cyclegan with Noise-informed Training

Authors: Wen-Yuan Ting, Syu-Siang Wang, Hsin-Li Chang, Borching Su, Yu Tsao

Abstract: Cycle-consistent generative adversarial networks (CycleGAN) were successfully applied to speech enhancement (SE) tasks with unpaired noisy-clean training data. The CycleGAN SE system adopted two generators and two discriminators trained with losses from noisy-to-clean and clean-to-noisy conversions. CycleGAN showed promising results for numerous SE tasks. Herein, we investigate a potential limitat… ▽ More Cycle-consistent generative adversarial networks (CycleGAN) were successfully applied to speech enhancement (SE) tasks with unpaired noisy-clean training data. The CycleGAN SE system adopted two generators and two discriminators trained with losses from noisy-to-clean and clean-to-noisy conversions. CycleGAN showed promising results for numerous SE tasks. Herein, we investigate a potential limitation of the clean-to-noisy conversion part and propose a novel noise-informed training (NIT) approach to improve the performance of the original CycleGAN SE system. The main idea of the NIT approach is to incorporate target domain information for clean-to-noisy conversion to facilitate a better training procedure. The experimental results confirmed that the proposed NIT approach improved the generalization capability of the original CycleGAN SE system with a notable margin. △ Less

Submitted 6 December, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Journal ref: ISCSLP 2022

arXiv:2109.02344

Information Theory-Guided Heuristic Progressive Multi-View Coding

Authors: Jiangmeng Li, Wenwen Qiang, Hang Gao, Bing Su, Farid Razzak, Jie Hu, Changwen Zheng, Hui Xiong

Abstract: Multi-view representation learning captures comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning (CL) to learn representations, regarded as a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the… ▽ More Multi-view representation learning captures comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning (CL) to learn representations, regarded as a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; and evenly measuring the similarities between terms might interfere with optimization. Importantly, few works research the theoretical framework of generalized self-supervised multi-view learning, especially for more than two views. To this end, we rethink the existing multi-view learning paradigm from the information theoretical perspective and then propose a novel information theoretical framework for generalized multi-view learning. Guided by it, we build a multi-view coding method with a three-tier progressive architecture, namely Information theory-guided heuristic Progressive Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC builds self-adjusted pools for contrasting, which utilizes a view filter to adaptively modify the pools. Lastly, in the instance-tier, we adopt a designed unified loss to learn discriminative representations and reduce the gradient interference. Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods. △ Less

Submitted 21 August, 2023; v1 submitted 6 September, 2021; originally announced September 2021.

Comments: We have uploaded a new version of this paper in arXiv:2308.10522, so that we have to withdrawal this paper

arXiv:2107.11943 [pdf, other]

Log-Polar Space Convolution for Convolutional Neural Networks

Authors: Bing Su, Ji-Rong Wen

Abstract: Convolutional neural networks use regular quadrilateral convolution kernels to extract features. Since the number of parameters increases quadratically with the size of the convolution kernel, many popular models use small convolution kernels, resulting in small local receptive fields in lower layers. This paper proposes a novel log-polar space convolution (LPSC) method, where the convolution kern… ▽ More Convolutional neural networks use regular quadrilateral convolution kernels to extract features. Since the number of parameters increases quadratically with the size of the convolution kernel, many popular models use small convolution kernels, resulting in small local receptive fields in lower layers. This paper proposes a novel log-polar space convolution (LPSC) method, where the convolution kernel is elliptical and adaptively divides its local receptive field into different regions according to the relative directions and logarithmic distances. The local receptive field grows exponentially with the number of distance levels. Therefore, the proposed LPSC not only naturally encodes local spatial structures, but also greatly increases the single-layer receptive field while maintaining the number of parameters. We show that LPSC can be implemented with conventional convolution via log-polar space pooling and can be applied in any network architecture to replace conventional convolutions. Experiments on different tasks and datasets demonstrate the effectiveness of the proposed LPSC. Code is available at https://github.com/BingSu12/Log-Polar-Space-Convolution. △ Less

Submitted 25 July, 2021; originally announced July 2021.

arXiv:2107.08892 [pdf, other]

Unsupervised Embedding Learning from Uncertainty Momentum Modeling

Authors: Jiahuan Zhou, Yansong Tang, Bing Su, Ying Wu

Abstract: Existing popular unsupervised embedding learning methods focus on enhancing the instance-level local discrimination of the given unlabeled images by exploring various negative data. However, the existed sample outliers which exhibit large intra-class divergences or small inter-class variations severely limit their learning performance. We justify that the performance limitation is caused by the gr… ▽ More Existing popular unsupervised embedding learning methods focus on enhancing the instance-level local discrimination of the given unlabeled images by exploring various negative data. However, the existed sample outliers which exhibit large intra-class divergences or small inter-class variations severely limit their learning performance. We justify that the performance limitation is caused by the gradient vanishing on these sample outliers. Moreover, the shortage of positive data and disregard for global discrimination consideration also pose critical issues for unsupervised learning but are always ignored by existing methods. To handle these issues, we propose a novel solution to explicitly model and directly explore the uncertainty of the given unlabeled learning samples. Instead of learning a deterministic feature point for each sample in the embedding space, we propose to represent a sample by a stochastic Gaussian with the mean vector depicting its space localization and covariance vector representing the sample uncertainty. We leverage such uncertainty modeling as momentum to the learning which is helpful to tackle the outliers. Furthermore, abundant positive candidates can be readily drawn from the learned instance-specific distributions which are further adopted to mitigate the aforementioned issues. Thorough rationale analyses and extensive experiments are presented to verify our superiority. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: 14 pages, in submission

arXiv:2106.02810 [pdf, other]

doi 10.21437/Interspeech.2021-1341

An Attribute-Aligned Strategy for Learning Speech Representation

Authors: Yu-Lin Huang, Bo-Hao Su, Y. -W. Peter Hong, Chi-Chun Lee

Abstract: Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selectio… ▽ More Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes, to derive an identity-free representation for speech emotion recognition (SER), and an emotionless representation for speaker verification (SV). Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV, comparing to the current state-of-the-art method of using adversarial learning applied on a large emotion corpora, the MSP-Podcast. Also, our proposed learning strategy reduces the model and training process needed to achieve multiple privacy-preserving tasks. △ Less

Submitted 8 September, 2021; v1 submitted 5 June, 2021; originally announced June 2021.

Comments: 5 pages, 2 figures; Accepted in Interspeech 2021

Journal ref: Proceedings of INTERSPEECH 2021

arXiv:2104.04953 [pdf, other]

SIGAN: A Novel Image Generation Method for Solar Cell Defect Segmentation and Augmentation

Authors: Binyi Su, Zhong Zhou, Haiyong Chen, Xiaochun Cao

Abstract: Solar cell electroluminescence (EL) defect segmentation is an interesting and challenging topic. Many methods have been proposed for EL defect detection, but these methods are still unsatisfactory due to the diversity of the defect and background. In this paper, we provide a new idea of using generative adversarial network (GAN) for defect segmentation. Firstly, the GAN-based method removes the de… ▽ More Solar cell electroluminescence (EL) defect segmentation is an interesting and challenging topic. Many methods have been proposed for EL defect detection, but these methods are still unsatisfactory due to the diversity of the defect and background. In this paper, we provide a new idea of using generative adversarial network (GAN) for defect segmentation. Firstly, the GAN-based method removes the defect region in the input defective image to get a defect-free image, while kee** the background almost unchanged. Then, the subtracted image is obtained by making difference between the defective input image with the generated defect-free image. Finally, the defect region can be segmented through thresholding the subtracted image. To keep the background unchanged before and after image generation, we propose a novel strong identity GAN (SIGAN), which adopts a novel strong identity loss to constraint the background consistency. The SIGAN can be used not only for defect segmentation, but also small-samples defective dataset augmentation. Moreover, we release a new solar cell EL image dataset named as EL-2019, which includes three types of images: crack, finger interruption and defect-free. Experiments on EL-2019 dataset show that the proposed method achieves 90.34% F-score, which outperforms many state-of-the-art methods in terms of solar cell defects segmentation results. △ Less

Submitted 11 April, 2021; originally announced April 2021.

Comments: 11 pages, 11 figures

arXiv:2101.03355 [pdf, other]

doi 10.1109/TVT.2022.3155627

Downlink SCMA Codebook Design with Low Error Rate by Maximizing Minimum Euclidean Distance of Superimposed Codewords

Authors: Chinwei Huang, Borching Su, Tingyi Lin, Yenming Huang

Abstract: Sparse code multiple access (SCMA), as a codebook-based non-orthogonal multiple access (NOMA) technique, has received research attention in recent years. The codebook design problem for SCMA has also been studied to some extent since codebook choices are highly related to the system's error rate performance. In this paper, we approach the SCMA codebook design problem by formulating an optimization… ▽ More Sparse code multiple access (SCMA), as a codebook-based non-orthogonal multiple access (NOMA) technique, has received research attention in recent years. The codebook design problem for SCMA has also been studied to some extent since codebook choices are highly related to the system's error rate performance. In this paper, we approach the SCMA codebook design problem by formulating an optimization problem to maximize the minimum Euclidean distance (MED) of superimposed codewords under power constraints. While SCMA codebooks with a larger MED are expected to obtain a better BER performance, no optimal SCMA codebook in terms of MED maximization, to the authors' best knowledge, has been reported in the SCMA literature yet. In this paper, a new iterative algorithm based on alternating maximization with exact penalty is proposed for the MED maximization problem. The proposed algorithm, when supplied with appropriate initial points and parameters, achieves a set of codebooks of all users whose MED is larger than any previously reported results. A Lagrange dual problem is derived which provides an upper bound of MED of any set of codebooks. Even though there is still a nonzero gap between the achieved MED and the upper bound given by the dual problem, simulation results demonstrate clear advantages in error rate performances of the proposed set of codebooks over all existing ones not only in AWGN channels but also in some downlink scenarios that fit in 5G/NR applications, making it a good codebook candidate thereof. The proposed set of SCMA codebooks, however, are not shown to outperform existing ones in uplink channels or in the case where non-consecutive OFDMA subcarriers are used. The correctness and accuracy of error curves in the simulation results are further confirmed by the coincidences with the theoretical upper bounds of error rates derived for any given set of codebooks. △ Less

Submitted 1 May, 2022; v1 submitted 9 January, 2021; originally announced January 2021.

Comments: 15 pages, 12 figures. This version is accepted to IEEE Transactions on Vehicular Technology, and the copyright is transferred to IEEE

arXiv:2012.11174 [pdf, other]

Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network

Authors: Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, Helen Meng

Abstract: By using deep learning approaches, Speech Emotion Recog-nition (SER) on a single domain has achieved many excellentresults. However, cross-domain SER is still a challenging taskdue to the distribution shift between source and target domains.In this work, we propose a Domain Adversarial Neural Net-work (DANN) based approach to mitigate this distribution shiftproblem for cross-lingual SER. Specifica… ▽ More By using deep learning approaches, Speech Emotion Recog-nition (SER) on a single domain has achieved many excellentresults. However, cross-domain SER is still a challenging taskdue to the distribution shift between source and target domains.In this work, we propose a Domain Adversarial Neural Net-work (DANN) based approach to mitigate this distribution shiftproblem for cross-lingual SER. Specifically, we add a languageclassifier and gradient reversal layer after the feature extractor toforce the learned representation both language-independent andemotion-meaningful. Our method is unsupervised, i. e., labelson target language are not required, which makes it easier to ap-ply our method to other languages. Experimental results showthe proposed method provides an average absolute improve-ment of 3.91% over the baseline system for arousal and valenceclassification task. Furthermore, we find that batch normaliza-tion is beneficial to the performance gain of DANN. Thereforewe also explore the effect of different ways of data combinationfor batch normalization. △ Less

Submitted 21 December, 2020; originally announced December 2020.

Comments: This paper has been accepted by ISCSLP2021

ACM Class: I.2

Showing 1–50 of 67 results for author: Su, B