Search | arXiv e-print repository

Efficient Spiking Neural Networks with Radix Encoding

Authors: Zhehui Wang, Xiaozhe Gu, Rick Goh, Joey Tianyi Zhou, Tao Luo

Abstract: Spiking neural networks (SNNs) have advantages in latency and energy efficiency over traditional artificial neural networks (ANNs) due to its event-driven computation mechanism and replacement of energy-consuming weight multiplications with additions. However, in order to reach accuracy of its ANN counterpart, it usually requires long spike trains to ensure the accuracy. Traditionally, a spike tra… ▽ More Spiking neural networks (SNNs) have advantages in latency and energy efficiency over traditional artificial neural networks (ANNs) due to its event-driven computation mechanism and replacement of energy-consuming weight multiplications with additions. However, in order to reach accuracy of its ANN counterpart, it usually requires long spike trains to ensure the accuracy. Traditionally, a spike train needs around one thousand time steps to approach similar accuracy as its ANN counterpart. This offsets the computation efficiency brought by SNNs because longer spike trains mean a larger number of operations and longer latency. In this paper, we propose a radix encoded SNN with ultra-short spike trains. In the new model, the spike train takes less than ten time steps. Experiments show that our method demonstrates 25X speedup and 1.1% increment on accuracy, compared with the state-of-the-art work on VGG-16 network architecture and CIFAR-10 dataset. △ Less

Submitted 2 November, 2023; v1 submitted 14 May, 2021; originally announced May 2021.

arXiv:2105.06247 [pdf, other]

doi 10.1145/3404835.3462874

Video Corpus Moment Retrieval with Contrastive Learning

Authors: Hao Zhang, Aixin Sun, Wei **g, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh

Abstract: Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality r… ▽ More Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning. △ Less

Submitted 13 May, 2021; originally announced May 2021.

Comments: 11 pages, 7 figures and 6 tables. Accepted by SIGIR 2021

arXiv:2103.16074 [pdf, other]

PointBA: Towards Backdoor Attacks in 3D Point Cloud

Authors: Xinke Li, Zhirui Chen, Yue Zhao, Zekun Tong, Yabang Zhao, Andrew Lim, Joey Tianyi Zhou

Abstract: 3D deep learning has been increasingly more popular for a variety of tasks including many safety-critical applications. However, recently several works raise the security issues of 3D deep models. Although most of them consider adversarial attacks, we identify that backdoor attack is indeed a more serious threat to 3D deep learning systems but remains unexplored. We present the backdoor attacks in… ▽ More 3D deep learning has been increasingly more popular for a variety of tasks including many safety-critical applications. However, recently several works raise the security issues of 3D deep models. Although most of them consider adversarial attacks, we identify that backdoor attack is indeed a more serious threat to 3D deep learning systems but remains unexplored. We present the backdoor attacks in 3D point cloud with a unified framework that exploits the unique properties of 3D data and networks. In particular, we design two attack approaches on point cloud: the poison-label backdoor attack (PointPBA) and the clean-label backdoor attack (PointCBA). The first one is straightforward and effective in practice, while the latter is more sophisticated assuming there are certain data inspections. The attack algorithms are mainly motivated and developed by 1) the recent discovery of 3D adversarial samples suggesting the vulnerability of deep models under spatial transformation; 2) the proposed feature disentanglement technique that manipulates the feature of the data through optimization methods and its potential to embed a new task. Extensive experiments show the efficacy of the PointPBA with over 95% success rate across various 3D datasets and models, and the more stealthy PointCBA with around 50% success rate. Our proposed backdoor attack in 3D point cloud is expected to perform as a baseline for improving the robustness of 3D deep models. △ Less

Submitted 22 August, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: Accepted by ICCV 2021

arXiv:2103.14493 [pdf, other]

RCT: Resource Constrained Training for Edge AI

Authors: Tian Huang, Tao Luo, Ming Yan, Joey Tianyi Zhou, Rick Goh

Abstract: Neural networks training on edge terminals is essential for edge AI computing, which needs to be adaptive to evolving environment. Quantised models can efficiently run on edge devices, but existing training methods for these compact models are designed to run on powerful servers with abundant memory and energy budget. For example, quantisation-aware training (QAT) method involves two copies of mod… ▽ More Neural networks training on edge terminals is essential for edge AI computing, which needs to be adaptive to evolving environment. Quantised models can efficiently run on edge devices, but existing training methods for these compact models are designed to run on powerful servers with abundant memory and energy budget. For example, quantisation-aware training (QAT) method involves two copies of model parameters, which is usually beyond the capacity of on-chip memory in edge devices. Data movement between off-chip and on-chip memory is energy demanding as well. The resource requirements are trivial for powerful servers, but critical for edge devices. To mitigate these issues, We propose Resource Constrained Training (RCT). RCT only keeps a quantised model throughout the training, so that the memory requirements for model parameters in training is reduced. It adjusts per-layer bitwidth dynamically in order to save energy when a model can learn effectively with lower precision. We carry out experiments with representative models and tasks in image application and natural language processing. Experiments show that RCT saves more than 86\% energy for General Matrix Multiply (GEMM) and saves more than 46\% memory for model parameters, with limited accuracy loss. Comparing with QAT-based method, RCT saves about half of energy on moving model parameters. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: 14 pages

MSC Class: 68T07 (Primary) 68T05 (Secondary) ACM Class: I.5.1; I.2.6

arXiv:2102.13558 [pdf, other]

doi 10.1109/TPAMI.2021.3060449

Natural Language Video Localization: A Revisit in Span-based Question Answering Framework

Authors: Hao Zhang, Aixin Sun, Wei **g, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh

Abstract: Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we a… ▽ More Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem. △ Less

Submitted 2 March, 2021; v1 submitted 26 February, 2021; originally announced February 2021.

Comments: 15 pages, 18 figures, and 10 tables. Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). arXiv admin note: substantial text overlap with arXiv:2004.13931

Report number: TPAMI-2020-09-1337.R1

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

arXiv:2102.02051 [pdf, other]

Trusted Multi-View Classification

Authors: Zongbo Han, Changqing Zhang, Huazhu Fu, Joey Tianyi Zhou

Abstract: Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predic… ▽ More Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predictions can be trusted. To this end, we propose a novel multi-view classification method, termed trusted multi-view classification, which provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The algorithm jointly utilizes multiple views to promote both classification reliability and robustness by integrating evidence from each view. To achieve this, the Dirichlet distribution is used to model the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness for out-of-distribution samples. Extensive experimental results validate the effectiveness of the proposed model in accuracy, reliability and robustness. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: Accepted by ICLR 2021

arXiv:2101.01149 [pdf, ps, other]

Deep Learning for Latent Events Forecasting in Twitter Aided Caching Networks

Authors: Zhong Yang, Yuanwei Liu, Yue Chen, Joey Tianyi Zhou

Abstract: A novel Twitter context aided content caching (TAC) framework is proposed for enhancing the caching efficiency by taking advantage of the legibility and massive volume of Twitter data. For the purpose of promoting the caching efficiency, three machine learning models are proposed to predict latent events and events popularity, utilizing collect Twitter data with geo-tags and geographic information… ▽ More A novel Twitter context aided content caching (TAC) framework is proposed for enhancing the caching efficiency by taking advantage of the legibility and massive volume of Twitter data. For the purpose of promoting the caching efficiency, three machine learning models are proposed to predict latent events and events popularity, utilizing collect Twitter data with geo-tags and geographic information of the adjacent base stations (BSs). Firstly, we propose a latent Dirichlet allocation (LDA) model for latent events forecasting taking advantage of the superiority of the LDA model in natural language processing (NLP). Then, we conceive long short-term memory (LSTM) with skip-gram embedding approach and LSTM with continuous skip-gram-Geo-aware embedding approach for the events popularity forecasting. Lastly, we associate the predicted latent events and the popularity of the events with the caching strategy. Extensive practical experiments demonstrate that: (1) The proposed TAC framework outperforms the conventional caching framework and is capable of being employed in practical applications thanks to the associating ability with public interests. (2) The proposed LDA approach conserves superiority for natural language processing (NLP) in Twitter data. (3) The perplexity of the proposed skip-gram-based LSTM is lower compared with the conventional LDA approach. (4) Evaluation of the model demonstrates that the hit rates of tweets of the model vary from 50% to 65% and the hit rate of the caching contents is up to approximately 75\% with smaller caching space compared to conventional algorithms. △ Less

Submitted 4 January, 2021; originally announced January 2021.

Comments: 30 pages, 15 figures

arXiv:2012.12775 [pdf, other]

Adaptive Precision Training for Resource Constrained Devices

Authors: Tian Huang, Tao Luo, Joey Tianyi Zhou

Abstract: Learn in-situ is a growing trend for Edge AI. Training deep neural network (DNN) on edge devices is challenging because both energy and memory are constrained. Low precision training helps to reduce the energy cost of a single training iteration, but that does not necessarily translate to energy savings for the whole training process, because low precision could slows down the convergence rate. On… ▽ More Learn in-situ is a growing trend for Edge AI. Training deep neural network (DNN) on edge devices is challenging because both energy and memory are constrained. Low precision training helps to reduce the energy cost of a single training iteration, but that does not necessarily translate to energy savings for the whole training process, because low precision could slows down the convergence rate. One evidence is that most works for low precision training keep an fp32 copy of the model during training, which in turn imposes memory requirements on edge devices. In this work we propose Adaptive Precision Training. It is able to save both total training energy cost and memory usage at the same time. We use model of the same precision for both forward and backward pass in order to reduce memory usage for training. Through evaluating the progress of training, APT allocates layer-wise precision dynamically so that the model learns quicker for longer time. APT provides an application specific hyper-parameter for users to play trade-off between training energy cost, memory usage and accuracy. Experiment shows that APT achieves more than 50% saving on training energy and memory usage with limited accuracy loss. 20% more savings of training energy and memory usage can be achieved in return for a 1% sacrifice in accuracy loss. △ Less

Submitted 23 December, 2020; originally announced December 2020.

Comments: 6 pages

arXiv:2011.06170 [pdf, other]

Deep Partial Multi-View Learning

Authors: Changqing Zhang, Yajie Cui, Zongbo Han, Joey Tianyi Zhou, Huazhu Fu, Qinghua Hu

Abstract: Although multi-view learning has made signifificant progress over the past few decades, it is still challenging due to the diffificulty in modeling complex correlations among different views, especially under the context of view missing. To address the challenge, we propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets), which aims to fully and flflexibly take advantage of m… ▽ More Although multi-view learning has made signifificant progress over the past few decades, it is still challenging due to the diffificulty in modeling complex correlations among different views, especially under the context of view missing. To address the challenge, we propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets), which aims to fully and flflexibly take advantage of multiple partial views. We fifirst provide a formal defifinition of completeness and versatility for multi-view representation and then theoretically prove the versatility of the learned latent representations. For completeness, the task of learning latent multi-view representation is specififically translated to a degradation process by mimicking data transmission, such that the optimal tradeoff between consistency and complementarity across different views can be implicitly achieved. Equipped with adversarial strategy, our model stably imputes missing views, encoding information from all views for each sample to be encoded into latent representation to further enhance the completeness. Furthermore, a nonparametric classifification loss is introduced to produce structured representations and prevent overfifitting, which endows the algorithm with promising generalization under view-missing cases. Extensive experimental results validate the effectiveness of our algorithm over existing state of the arts for classifification, representation learning and data imputation. △ Less

Submitted 11 November, 2020; originally announced November 2020.

arXiv:2010.11655 [pdf, other]

Deep Reinforcement Learning with Stacked Hierarchical Attention for Text-based Games

Authors: Yunqiu Xu, Meng Fang, Ling Chen, Yali Du, Joey Tianyi Zhou, Chengqi Zhang

Abstract: We study reinforcement learning (RL) for text-based games, which are interactive simulations in the context of natural language. While different methods have been developed to represent the environment information and language actions, existing RL agents are not empowered with any reasoning capabilities to deal with textual games. In this work, we aim to conduct explicit reasoning with knowledge g… ▽ More We study reinforcement learning (RL) for text-based games, which are interactive simulations in the context of natural language. While different methods have been developed to represent the environment information and language actions, existing RL agents are not empowered with any reasoning capabilities to deal with textual games. In this work, we aim to conduct explicit reasoning with knowledge graphs for decision making, so that the actions of an agent are generated and supported by an interpretable inference procedure. We propose a stacked hierarchical attention mechanism to construct an explicit representation of the reasoning process by exploiting the structure of the knowledge graph. We extensively evaluate our method on a number of man-made benchmark games, and the experimental results demonstrate that our method performs better than existing text-based agents. △ Less

Submitted 25 December, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: Accepted by NeurIPS2020

arXiv:2009.11719 [pdf, other]

Deep Neural Networks with Short Circuits for Improved Gradient Learning

Authors: Ming Yan, Xueli Xiao, Joey Tianyi Zhou, Yi Pan

Abstract: Deep neural networks have achieved great success both in computer vision and natural language processing tasks. However, mostly state-of-art methods highly rely on external training or computing to improve the performance. To alleviate the external reliance, we proposed a gradient enhancement approach, conducted by the short circuit neural connections, to improve the gradient learning of deep neur… ▽ More Deep neural networks have achieved great success both in computer vision and natural language processing tasks. However, mostly state-of-art methods highly rely on external training or computing to improve the performance. To alleviate the external reliance, we proposed a gradient enhancement approach, conducted by the short circuit neural connections, to improve the gradient learning of deep neural networks. The proposed short circuit is a unidirectional connection that single back propagates the sensitive from the deep layer to the shallows. Moreover, the short circuit formulates to be a gradient truncation of its crossing layers which can plug into the backbone deep neural networks without introducing external training parameters. Extensive experiments demonstrate deep neural networks with our short circuit gain a large margin over the baselines on both computer vision and natural language processing tasks. △ Less

Submitted 23 September, 2020; originally announced September 2020.

arXiv:2009.10465 [pdf, other]

Deep N-ary Error Correcting Output Codes

Authors: Hao Zhang, Joey Tianyi Zhou, Tianying Wang, Ivor W. Tsang, Rick Siow Mong Goh

Abstract: Ensemble learning consistently improves the performance of multi-class classification through aggregating a series of base classifiers. To this end, data-independent ensemble methods like Error Correcting Output Codes (ECOC) attract increasing attention due to its easiness of implementation and parallelization. Specifically, traditional ECOCs and its general extension N-ary ECOC decompose the orig… ▽ More Ensemble learning consistently improves the performance of multi-class classification through aggregating a series of base classifiers. To this end, data-independent ensemble methods like Error Correcting Output Codes (ECOC) attract increasing attention due to its easiness of implementation and parallelization. Specifically, traditional ECOCs and its general extension N-ary ECOC decompose the original multi-class classification problem into a series of independent simpler classification subproblems. Unfortunately, integrating ECOCs, especially N-ary ECOC with deep neural networks, termed as deep N-ary ECOC, is not straightforward and yet fully exploited in the literature, due to the high expense of training base learners. To facilitate the training of N-ary ECOC with deep learning base learners, we further propose three different variants of parameter sharing architectures for deep N-ary ECOC. To verify the generalization ability of deep N-ary ECOC, we conduct experiments by varying the backbone with different deep neural network architectures for both image and text classification tasks. Furthermore, extensive ablation studies on deep N-ary ECOC show its superior performance over other deep data-independent ensemble methods. △ Less

Submitted 14 December, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: EAI MOBIMEDIA 2020

arXiv:2009.09687 [pdf, other]

Contrastive Clustering

Authors: Yunfan Li, Peng Hu, Zitao Liu, Dezhong Peng, Joey Tianyi Zhou, Xi Peng

Abstract: In this paper, we propose a one-stage online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learnin… ▽ More In this paper, we propose a one-stage online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learning are respectively conducted in the row and column space by maximizing the similarities of positive pairs while minimizing those of negative ones. Our key observation is that the rows of the feature matrix could be regarded as soft labels of instances, and accordingly the columns could be further regarded as cluster representations. By simultaneously optimizing the instance- and cluster-level contrastive loss, the model jointly learns representations and cluster assignments in an end-to-end manner. Extensive experimental results show that CC remarkably outperforms 17 competitive clustering methods on six challenging image benchmarks. In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19\% (39\%) performance improvement compared with the best baseline. △ Less

Submitted 21 September, 2020; originally announced September 2020.

arXiv:2008.11401 [pdf, other]

Point Adversarial Self Mining: A Simple Method for Facial Expression Recognition

Authors: ** Liu, Yuewei Lin, Zibo Meng, Lu Lu, Weihong Deng, Joey Tianyi Zhou, Yi Yang

Abstract: In this paper, we propose a simple yet effective approach, named Point Adversarial Self Mining (PASM), to improve the recognition accuracy in facial expression recognition. Unlike previous works focusing on designing specific architectures or loss functions to solve this problem, PASM boosts the network capability by simulating human learning processes: providing updated learning materials and gui… ▽ More In this paper, we propose a simple yet effective approach, named Point Adversarial Self Mining (PASM), to improve the recognition accuracy in facial expression recognition. Unlike previous works focusing on designing specific architectures or loss functions to solve this problem, PASM boosts the network capability by simulating human learning processes: providing updated learning materials and guidance from more capable teachers. Specifically, to generate new learning materials, PASM leverages a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task, generating harder learning samples to refine the network. The searched position is highly adaptive since it considers both the statistical information of each sample and the teacher network capability. Other than being provided new learning materials, the student network also receives guidance from the teacher network. After the student network finishes training, the student network changes its role and acts as a teacher, generating new learning materials and providing stronger guidance to train a better student network. The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively. Extensive experimental results validate the efficacy of our method over the existing state of the arts for facial expression recognition. △ Less

Submitted 8 May, 2021; v1 submitted 26 August, 2020; originally announced August 2020.

arXiv:2007.06878 [pdf, other]

Attentive Graph Neural Networks for Few-Shot Learning

Authors: Hao Cheng, Joey Tianyi Zhou, Wee Peng Tay, Bihan Wen

Abstract: Graph Neural Networks (GNN) has demonstrated the superior performance in many challenging applications, including the few-shot learning tasks. Despite its powerful capacity to learn and generalize the model from few samples, GNN usually suffers from severe over-fitting and over-smoothing as the model becomes deep, which limit the scalability. In this work, we propose a novel Attentive GNN to tackl… ▽ More Graph Neural Networks (GNN) has demonstrated the superior performance in many challenging applications, including the few-shot learning tasks. Despite its powerful capacity to learn and generalize the model from few samples, GNN usually suffers from severe over-fitting and over-smoothing as the model becomes deep, which limit the scalability. In this work, we propose a novel Attentive GNN to tackle these challenges, by incorporating a triple-attention mechanism, i.e. node self-attention, neighborhood attention, and layer memory attention. We explain why the proposed attentive modules can improve GNN for few-shot learning with theoretical analysis and illustrations. Extensive experiments show that the proposed Attentive GNN model achieves the promising results, comparing to the state-of-the-art GNN- and CNN-based methods for few-shot learning tasks, over the mini-ImageNet and tiered-ImageNet benchmarks, under ConvNet-4 and ResNet-based backbone with both inductive and transductive settings. The codes will be made publicly available. △ Less

Submitted 2 October, 2020; v1 submitted 14 July, 2020; originally announced July 2020.

arXiv:2007.05720 [pdf, other]

ECML: An Ensemble Cascade Metric Learning Mechanism towards Face Verification

Authors: Fu Xiong, Yang Xiao, Zhiguo Cao, Yancheng Wang, Joey Tianyi Zhou, Jianxi Wu

Abstract: Face verification can be regarded as a 2-class fine-grained visual recognition problem. Enhancing the feature's discriminative power is one of the key problems to improve its performance. Metric learning technology is often applied to address this need, while achieving a good tradeoff between underfitting and overfitting plays the vital role in metric learning. Hence, we propose a novel ensemble c… ▽ More Face verification can be regarded as a 2-class fine-grained visual recognition problem. Enhancing the feature's discriminative power is one of the key problems to improve its performance. Metric learning technology is often applied to address this need, while achieving a good tradeoff between underfitting and overfitting plays the vital role in metric learning. Hence, we propose a novel ensemble cascade metric learning (ECML) mechanism. In particular, hierarchical metric learning is executed in the cascade way to alleviate underfitting. Meanwhile, at each learning level, the features are split into non-overlap** groups. Then, metric learning is executed among the feature groups in the ensemble manner to resist overfitting. Considering the feature distribution characteristics of faces, a robust Mahalanobis metric learning method (RMML) with closed-form solution is additionally proposed. It can avoid the computation failure issue on inverse matrix faced by some well-known metric learning approaches (e.g., KISSME). Embedding RMML into the proposed ECML mechanism, our metric learning paradigm (EC-RMML) can run in the one-pass learning manner. Experimental results demonstrate that EC-RMML is superior to state-of-the-art metric learning methods for face verification. And, the proposed ensemble cascade metric learning mechanism is also applicable to other metric learning approaches. △ Less

Submitted 11 July, 2020; originally announced July 2020.

Comments: Accepted to IEEE Transaction on Cybernetics

arXiv:2006.16829 [pdf, other]

You Only Look Yourself: Unsupervised and Untrained Single Image Dehazing Neural Network

Authors: Boyun Li, Yuanbiao Gou, Shuhang Gu, Jerry Zitao Liu, Joey Tianyi Zhou, Xi Peng

Abstract: In this paper, we study two challenging and less-touched problems in single image dehazing, namely, how to make deep learning achieve image dehazing without training on the ground-truth clean image (unsupervised) and a image collection (untrained). An unsupervised neural network will avoid the intensive labor collection of hazy-clean image pairs, and an untrained model is a ``real'' single image d… ▽ More In this paper, we study two challenging and less-touched problems in single image dehazing, namely, how to make deep learning achieve image dehazing without training on the ground-truth clean image (unsupervised) and a image collection (untrained). An unsupervised neural network will avoid the intensive labor collection of hazy-clean image pairs, and an untrained model is a ``real'' single image dehazing approach which could remove haze based on only the observed hazy image itself and no extra images is used. Motivated by the layer disentanglement idea, we propose a novel method, called you only look yourself (\textbf{YOLY}) which could be one of the first unsupervised and untrained neural networks for image dehazing. In brief, YOLY employs three jointly subnetworks to separate the observed hazy image into several latent layers, \textit{i.e.}, scene radiance layer, transmission map layer, and atmospheric light layer. After that, these three layers are further composed to the hazy image in a self-supervised manner. Thanks to the unsupervised and untrained characteristics of YOLY, our method bypasses the conventional training paradigm of deep models on hazy-clean pairs or a large scale dataset, thus avoids the labor-intensive data collection and the domain shift issue. Besides, our method also provides an effective learning-based haze transfer solution thanks to its layer disentanglement mechanism. Extensive experiments show the promising performance of our method in image dehazing compared with 14 methods on four databases. △ Less

Submitted 30 June, 2020; originally announced June 2020.

arXiv:2006.04588 [pdf, ps, other]

EDCompress: Energy-Aware Model Compression for Dataflows

Authors: Zhehui Wang, Tao Luo, Joey Tianyi Zhou, Rick Siow Mong Goh

Abstract: Edge devices demand low energy consumption, cost and small form factor. To efficiently deploy convolutional neural network (CNN) models on edge device, energy-aware model compression becomes extremely important. However, existing work did not study this problem well because the lack of considering the diversity of dataflow types in hardware architectures. In this paper, we propose EDCompress, an E… ▽ More Edge devices demand low energy consumption, cost and small form factor. To efficiently deploy convolutional neural network (CNN) models on edge device, energy-aware model compression becomes extremely important. However, existing work did not study this problem well because the lack of considering the diversity of dataflow types in hardware architectures. In this paper, we propose EDCompress, an Energy-aware model compression method for various Dataflows. It can effectively reduce the energy consumption of various edge devices, with different dataflow types. Considering the very nature of model compression procedures, we recast the optimization process to a multi-step problem, and solve it by reinforcement learning algorithms. Experiments show that EDCompress could improve 20X, 17X, 37X energy efficiency in VGG-16, MobileNet, LeNet-5 networks, respectively, with negligible loss of accuracy. EDCompress could also find the optimal dataflow type for specific neural networks in terms of energy consumption, which can guide the deployment of CNN models on hardware systems. △ Less

Submitted 11 July, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

arXiv:2005.08551 [pdf, other]

Omni-supervised Facial Expression Recognition via Distilled Data

Authors: ** Liu, Yunchao Wei, Zibo Meng, Weihong Deng, Joey Tianyi Zhou, Yi Yang

Abstract: Facial expression plays an important role in understanding human emotions. Most recently, deep learning based methods have shown promising for facial expression recognition. However, the performance of the current state-of-the-art facial expression recognition (FER) approaches is directly related to the labeled data for training. To solve this issue, prior works employ the pretrain-and-finetune st… ▽ More Facial expression plays an important role in understanding human emotions. Most recently, deep learning based methods have shown promising for facial expression recognition. However, the performance of the current state-of-the-art facial expression recognition (FER) approaches is directly related to the labeled data for training. To solve this issue, prior works employ the pretrain-and-finetune strategy, i.e., utilize a large amount of unlabeled data to pretrain the network and then finetune it by the labeled data. As the labeled data is in a small amount, the final network performance is still restricted. From a different perspective, we propose to perform omni-supervised learning to directly exploit reliable samples in a large amount of unlabeled data for network training. Particularly, a new dataset is firstly constructed using a primitive model trained on a small number of labeled samples to select samples with high confidence scores from a face dataset, i.e., MS-Celeb-1M, based on feature-wise similarity. We experimentally verify that the new dataset created in such an omni-supervised manner can significantly improve the generalization ability of the learned FER model. However, as the number of training samples grows, computational cost and training time increase dramatically. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images, significantly improving the training efficiency. We have conducted extensive experiments on widely used benchmarks, where consistent performance gains can be achieved under various settings using the proposed framework. More importantly, the distilled dataset has shown its capabilities of boosting the performance of FER with negligible additional computational costs. △ Less

Submitted 8 December, 2021; v1 submitted 18 May, 2020; originally announced May 2020.

arXiv:2005.07902 [pdf, other]

The Power of Triply Complementary Priors for Image Compressive Sensing

Authors: Zhiyuan Zha, Xin Yuan, Joey Tianyi Zhou, Jiantao Zhou, Bihan Wen, Ce Zhu

Abstract: Recent works that utilized deep models have achieved superior results in various image restoration applications. Such approach is typically supervised which requires a corpus of training images with distribution similar to the images to be recovered. On the other hand, the shallow methods which are usually unsupervised remain promising performance in many inverse problems, \eg, image compressive s… ▽ More Recent works that utilized deep models have achieved superior results in various image restoration applications. Such approach is typically supervised which requires a corpus of training images with distribution similar to the images to be recovered. On the other hand, the shallow methods which are usually unsupervised remain promising performance in many inverse problems, \eg, image compressive sensing (CS), as they can effectively leverage non-local self-similarity priors of natural images. However, most of such methods are patch-based leading to the restored images with various ringing artifacts due to naive patch aggregation. Using either approach alone usually limits performance and generalizability in image restoration tasks. In this paper, we propose a joint low-rank and deep (LRD) image model, which contains a pair of triply complementary priors, namely \textit{external} and \textit{internal}, \textit{deep} and \textit{shallow}, and \textit{local} and \textit{non-local} priors. We then propose a novel hybrid plug-and-play (H-PnP) framework based on the LRD model for image CS. To make the optimization tractable, a simple yet effective algorithm is proposed to solve the proposed H-PnP based image CS problem. Extensive experimental results demonstrate that the proposed H-PnP algorithm significantly outperforms the state-of-the-art techniques for image CS recovery such as SCSNet and WNNM. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Journal ref: 2020 International Conference on Image Processing

arXiv:2005.05501 [pdf, other]

3DV: 3D Dynamic Voxel for Action Recognition in Depth Video

Authors: Yancheng Wang, Yang Xiao, Fu Xiong, Wenxiang Jiang, Zhiguo Cao, Joey Tianyi Zhou, Junsong Yuan

Abstract: To facilitate depth-based 3D action recognition, 3D dynamic voxel (3DV) is proposed as a novel 3D motion representation. With 3D space voxelization, the key idea of 3DV is to encode 3D motion information within depth video into a regular voxel set (i.e., 3DV) compactly, via temporal rank pooling. Each available 3DV voxel intrinsically involves 3D spatial and motion feature jointly. 3DV is then abs… ▽ More To facilitate depth-based 3D action recognition, 3D dynamic voxel (3DV) is proposed as a novel 3D motion representation. With 3D space voxelization, the key idea of 3DV is to encode 3D motion information within depth video into a regular voxel set (i.e., 3DV) compactly, via temporal rank pooling. Each available 3DV voxel intrinsically involves 3D spatial and motion feature jointly. 3DV is then abstracted as a point set and input into PointNet++ for 3D action recognition, in the end-to-end learning way. The intuition for transferring 3DV into the point set form is that, PointNet++ is lightweight and effective for deep feature learning towards point set. Since 3DV may lose appearance clue, a multi-stream 3D action recognition manner is also proposed to learn motion and appearance feature jointly. To extract richer temporal order information of actions, we also divide the depth video into temporal splits and encode this procedure in 3DV integrally. The extensive experiments on 4 well-established benchmark datasets demonstrate the superiority of our proposition. Impressively, we acquire the accuracy of 82.4% and 93.5% on NTU RGB+D 120 [13] with the cross-subject and crosssetup test setting respectively. 3DV's code is available at https://github.com/3huo/3DV-Action. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: Accepted by CVPR2020

arXiv:2004.14798 [pdf, other]

RAIN: A Simple Approach for Robust and Accurate Image Classification Networks

Authors: Jiawei Du, Hanshu Yan, Vincent Y. F. Tan, Joey Tianyi Zhou, Rick Siow Mong Goh, Jiashi Feng

Abstract: It has been shown that the majority of existing adversarial defense methods achieve robustness at the cost of sacrificing prediction accuracy. The undesirable severe drop in accuracy adversely affects the reliability of machine learning algorithms and prohibits their deployment in realistic applications. This paper aims to address this dilemma by proposing a novel preprocessing framework, which we… ▽ More It has been shown that the majority of existing adversarial defense methods achieve robustness at the cost of sacrificing prediction accuracy. The undesirable severe drop in accuracy adversely affects the reliability of machine learning algorithms and prohibits their deployment in realistic applications. This paper aims to address this dilemma by proposing a novel preprocessing framework, which we term Robust and Accurate Image classificatioN(RAIN), to improve the robustness of given CNN classifiers and, at the same time, preserve their high prediction accuracies. RAIN introduces a new randomization-enhancement scheme. It applies randomization over inputs to break the ties between the model forward prediction path and the backward gradient path, thus improving the model robustness. However, similar to existing preprocessing-based methods, the randomized process will degrade the prediction accuracy. To understand why this is the case, we compare the difference between original and processed images, and find it is the loss of high-frequency components in the input image that leads to accuracy drop of the classifier. Based on this finding, RAIN enhances the input's high-frequency details to retain the CNN's high prediction accuracy. Concretely, RAIN consists of two novel randomization modules: randomized small circular shift (RdmSCS) and randomized down-upsampling (RdmDU). The RdmDU module randomly downsamples the input image, and then the RdmSCS module circularly shifts the input image along a randomly chosen direction by a small but random number of pixels. Finally, the RdmDU module performs upsampling with a detail-enhancement model, such as deep super-resolution networks. We conduct extensive experiments on the STL10 and ImageNet datasets to verify the effectiveness of RAIN against various types of adversarial attacks. △ Less

Submitted 4 November, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

arXiv:2004.13931 [pdf, other]

Span-based Localizing Network for Natural Language Video Localization

Authors: Hao Zhang, Aixin Sun, Wei **g, Joey Tianyi Zhou

Abstract: Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA… ▽ More Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods; and adopting span-based QA framework is a promising direction to solve NLVL. △ Less

Submitted 14 June, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: To appear at ACL 2020

arXiv:2004.13303 [pdf, other]

Heterogeneous Representation Learning: A Review

Authors: Joey Tianyi Zhou, Xi Peng, Yew-Soon Ong

Abstract: The real-world data usually exhibits heterogeneous properties such as modalities, views, or resources, which brings some unique challenges wherein the key is Heterogeneous Representation Learning (HRL) termed in this paper. This brief survey covers the topic of HRL, centered around several major learning settings and real-world applications. First of all, from the mathematical perspective, we pres… ▽ More The real-world data usually exhibits heterogeneous properties such as modalities, views, or resources, which brings some unique challenges wherein the key is Heterogeneous Representation Learning (HRL) termed in this paper. This brief survey covers the topic of HRL, centered around several major learning settings and real-world applications. First of all, from the mathematical perspective, we present a unified learning framework which is able to model most existing learning settings with the heterogeneous inputs. After that, we conduct a comprehensive discussion on the HRL framework by reviewing some selected learning problems along with the mathematics perspectives, including multi-view learning, heterogeneous transfer learning, Learning using privileged information and heterogeneous multi-task learning. For each learning task, we also discuss some applications under these learning problems and instantiates the terms in the mathematical framework. Finally, we highlight the challenges that are less-touched in HRL and present future research directions. To the best of our knowledge, there is no such framework to unify these heterogeneous problems, and this survey would benefit the community. △ Less

Submitted 30 April, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

arXiv:2004.01980 [pdf, other]

Hooks in the Headline: Learning to Generate Headlines with Controlled Styles

Authors: Di **, Zhi**g **, Joey Tianyi Zhou, Lisa Orii, Peter Szolovits

Abstract: Current summarization systems only produce plain, factual headlines, but do not meet the practical needs of creating memorable titles to increase exposure. We propose a new task, Stylistic Headline Generation (SHG), to enrich the headlines with three style options (humor, romance and clickbait), in order to attract more readers. With no style-specific article-headline pair (only a standard headlin… ▽ More Current summarization systems only produce plain, factual headlines, but do not meet the practical needs of creating memorable titles to increase exposure. We propose a new task, Stylistic Headline Generation (SHG), to enrich the headlines with three style options (humor, romance and clickbait), in order to attract more readers. With no style-specific article-headline pair (only a standard headline summarization dataset and mono-style corpora), our method TitleStylist generates style-specific headlines by combining the summarization and reconstruction tasks into a multitasking framework. We also introduced a novel parameter sharing scheme to further disentangle the style from the text. Through both automatic and human evaluation, we demonstrate that TitleStylist can generate relevant, fluent headlines with three target styles: humor, romance, and clickbait. The attraction score of our model generated headlines surpasses that of the state-of-the-art summarization model by 9.68%, and even outperforms human-written references. △ Less

Submitted 28 May, 2020; v1 submitted 4 April, 2020; originally announced April 2020.

Comments: ACL 2020

Report number: 12 pages

arXiv:2001.08140 [pdf, other]

A Simple Baseline to Semi-Supervised Domain Adaptation for Machine Translation

Authors: Di **, Zhi**g **, Joey Tianyi Zhou, Peter Szolovits

Abstract: State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data. As data collection is expensive and infeasible in many cases, domain adaptation methods are needed. In this work, we propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT, where the aim is to improve the performance of a transl… ▽ More State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data. As data collection is expensive and infeasible in many cases, domain adaptation methods are needed. In this work, we propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT, where the aim is to improve the performance of a translation model on the target domain consisting of only non-parallel data with the help of supervised source domain data. This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation. We evaluate this method on two adaptation settings: adaptation between specific domains and adaptation from a general domain to specific domains, and on two language pairs: German to English and Romanian to English. With substantial performance improvement achieved---up to +19.31 BLEU over the strongest baseline, and +47.69 BLEU improvement over the unadapted model---we present this method as a simple but tough-to-beat baseline in the field of semi-supervised domain adaptation for NMT. △ Less

Submitted 5 June, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

Comments: Under review

arXiv:1912.11236 [pdf, other]

Ordered or Orderless: A Revisit for Video based Person Re-Identification

Authors: Le Zhang, Zenglin Shi, Joey Tianyi Zhou, Ming-Ming Cheng, Yun Liu, Jia-Wang Bian, Zeng Zeng, Chunhua Shen

Abstract: Is recurrent network really necessary for learning a good visual representation for video based person re-identification (VPRe-id)? In this paper, we first show that the common practice of employing recurrent neural networks (RNNs) to aggregate temporal spatial features may not be optimal. Specifically, with a diagnostic analysis, we show that the recurrent structure may not be effective to learn… ▽ More Is recurrent network really necessary for learning a good visual representation for video based person re-identification (VPRe-id)? In this paper, we first show that the common practice of employing recurrent neural networks (RNNs) to aggregate temporal spatial features may not be optimal. Specifically, with a diagnostic analysis, we show that the recurrent structure may not be effective to learn temporal dependencies than what we expected and implicitly yields an orderless representation. Based on this observation, we then present a simple yet surprisingly powerful approach for VPRe-id, where we treat VPRe-id as an efficient orderless ensemble of image based person re-identification problem. More specifically, we divide videos into individual images and re-identify person with ensemble of image based rankers. Under the i.i.d. assumption, we provide an error bound that sheds light upon how could we improve VPRe-id. Our work also presents a promising way to bridge the gap between video and image based person re-identification. Comprehensive experimental evaluations demonstrate that the proposed solution achieves state-of-the-art performances on multiple widely used datasets (iLIDS-VID, PRID 2011, and MARS). △ Less

Submitted 24 December, 2019; originally announced December 2019.

Comments: Under Minor Revision in IEEE TPAMI

arXiv:1911.06137 [pdf, other]

Unsupervised Domain Adaptation on Reading Comprehension

Authors: Yu Cao, Meng Fang, Baosheng Yu, Joey Tianyi Zhou

Abstract: Reading comprehension (RC) has been studied in a variety of datasets with the boosted performance brought by deep neural networks. However, the generalization capability of these models across different domains remains unclear. To alleviate this issue, we are going to investigate unsupervised domain adaptation on RC, wherein a model is trained on labeled source domain and to be applied to the targ… ▽ More Reading comprehension (RC) has been studied in a variety of datasets with the boosted performance brought by deep neural networks. However, the generalization capability of these models across different domains remains unclear. To alleviate this issue, we are going to investigate unsupervised domain adaptation on RC, wherein a model is trained on labeled source domain and to be applied to the target domain with only unlabeled samples. We first show that even with the powerful BERT contextual representation, the performance is still unsatisfactory when the model trained on one dataset is directly applied to another target dataset. To solve this, we provide a novel conditional adversarial self-training method (CASe). Specifically, our approach leverages a BERT model fine-tuned on the source dataset along with the confidence filtering to generate reliable pseudo-labeled samples in the target domain for self-training. On the other hand, it further reduces domain distribution discrepancy through conditional adversarial learning across domains. Extensive experiments show our approach achieves comparable accuracy to supervised models on multiple large-scale benchmark datasets. △ Less

Submitted 26 July, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

Comments: 8 pages, 6 figures, 5 tables, Accepted by AAAI 2020

arXiv:1909.06940 [pdf, other]

Multi-graph Fusion for Multi-view Spectral Clustering

Authors: Zhao Kang, Guoxin Shi, Shudong Huang, Wenyu Chen, Xiaorong Pu, Joey Tianyi Zhou, Zenglin Xu

Abstract: A panoply of multi-view clustering algorithms has been developed to deal with prevalent multi-view data. Among them, spectral clustering-based methods have drawn much attention and demonstrated promising results recently. Despite progress, there are still two fundamental questions that stay unanswered to date. First, how to fuse different views into one graph. More often than not, the similarities… ▽ More A panoply of multi-view clustering algorithms has been developed to deal with prevalent multi-view data. Among them, spectral clustering-based methods have drawn much attention and demonstrated promising results recently. Despite progress, there are still two fundamental questions that stay unanswered to date. First, how to fuse different views into one graph. More often than not, the similarities between samples may be manifested differently by different views. Many existing algorithms either simply take the average of multiple views or just learn a common graph. These simple approaches fail to consider the flexible local manifold structures of all views. Hence, the rich heterogeneous information is not fully exploited. Second, how to learn the explicit cluster structure. Most existing methods don't pay attention to the quality of the graphs and perform graph learning and spectral clustering separately. Those unreliable graphs might lead to suboptimal clustering results. To fill these gaps, in this paper, we propose a novel multi-view spectral clustering model which performs graph fusion and spectral clustering simultaneously. The fusion graph approximates the original graph of each individual view but maintains an explicit cluster structure. Experiments on four widely used data sets confirm the superiority of the proposed method. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Comments: submitted to Knowledge-based Systems

arXiv:1908.09999 [pdf, other]

A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation from a Single Depth Image

Authors: Fu Xiong, Boshen Zhang, Yang Xiao, Zhiguo Cao, Taidong Yu, Joey Tianyi Zhou, Junsong Yuan

Abstract: For 3D hand and body pose estimation task in depth image, a novel anchor-based approach termed Anchor-to-Joint regression network (A2J) with the end-to-end learning ability is proposed. Within A2J, anchor points able to capture global-local spatial context information are densely set on depth image as local regressors for the joints. They contribute to predict the positions of the joints in ensemb… ▽ More For 3D hand and body pose estimation task in depth image, a novel anchor-based approach termed Anchor-to-Joint regression network (A2J) with the end-to-end learning ability is proposed. Within A2J, anchor points able to capture global-local spatial context information are densely set on depth image as local regressors for the joints. They contribute to predict the positions of the joints in ensemble way to enhance generalization ability. The proposed 3D articulated pose estimation paradigm is different from the state-of-the-art encoder-decoder based FCN, 3D CNN and point-set based manners. To discover informative anchor points towards certain joint, anchor proposal procedure is also proposed for A2J. Meanwhile 2D CNN (i.e., ResNet-50) is used as backbone network to drive A2J, without using time-consuming 3D convolutional or deconvolutional layers. The experiments on 3 hand datasets and 2 body datasets verify A2J's superiority. Meanwhile, A2J is of high running speed around 100 FPS on single NVIDIA 1080Ti GPU. △ Less

Submitted 26 August, 2019; originally announced August 2019.

Comments: Accepted by ICCV2019

arXiv:1908.09066 [pdf, other]

Robust Regression via Deep Negative Correlation Learning

Authors: Le Zhang, Zenglin Shi, Ming-Ming Cheng, Yun Liu, Jia-Wang Bian, Joey Tianyi Zhou, Guoyan Zheng, Zeng Zeng

Abstract: Nonlinear regression has been extensively employed in many computer vision problems (e.g., crowd counting, age estimation, affective computing). Under the umbrella of deep learning, two common solutions exist i) transforming nonlinear regression to a robust loss function which is jointly optimizable with the deep convolutional network, and ii) utilizing ensemble of deep networks. Although some imp… ▽ More Nonlinear regression has been extensively employed in many computer vision problems (e.g., crowd counting, age estimation, affective computing). Under the umbrella of deep learning, two common solutions exist i) transforming nonlinear regression to a robust loss function which is jointly optimizable with the deep convolutional network, and ii) utilizing ensemble of deep networks. Although some improved performance is achieved, the former may be lacking due to the intrinsic limitation of choosing a single hypothesis and the latter usually suffers from much larger computational complexity. To cope with those issues, we propose to regress via an efficient "divide and conquer" manner. The core of our approach is the generalization of negative correlation learning that has been shown, both theoretically and empirically, to work well for non-deep regression problems. Without extra parameters, the proposed method controls the bias-variance-covariance trade-off systematically and usually yields a deep regression ensemble where each base model is both "accurate" and "diversified". Moreover, we show that each sub-problem in the proposed method has less Rademacher Complexity and thus is easier to optimize. Extensive experiments on several diverse and challenging tasks including crowd counting, personality analysis, age estimation, and image super-resolution demonstrate the superiority over challenging baselines as well as the versatility of the proposed method. △ Less

Submitted 23 August, 2019; originally announced August 2019.

arXiv:1907.11932 [pdf, other]

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

Authors: Di **, Zhi**g **, Joey Tianyi Zhou, Peter Szolovits

Abstract: Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate natural… ▽ More Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate natural adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate the advantages of this framework in three ways: (1) effective---it outperforms state-of-the-art attacks in terms of success rate and perturbation rate, (2) utility-preserving---it preserves semantic content and grammaticality, and remains correctly classified by humans, and (3) efficient---it generates adversarial text with computational complexity linear to the text length. *The code, pre-trained target models, and test examples are available at https://github.com/**d11/TextFooler. △ Less

Submitted 8 April, 2020; v1 submitted 27 July, 2019; originally announced July 2019.

Comments: AAAI 2020 (Oral)

arXiv:1906.02398 [pdf, other]

Query-efficient Meta Attack to Deep Neural Networks

Authors: Jiawei Du, Hu Zhang, Joey Tianyi Zhou, Yi Yang, Jiashi Feng

Abstract: Black-box attack methods aim to infer suitable attack patterns to targeted DNN models by only using output feedback of the models and the corresponding input queries. However, due to lack of prior and inefficiency in leveraging the query and feedback information, existing methods are mostly query-intensive for obtaining effective attack patterns. In this work, we propose a meta attack approach tha… ▽ More Black-box attack methods aim to infer suitable attack patterns to targeted DNN models by only using output feedback of the models and the corresponding input queries. However, due to lack of prior and inefficiency in leveraging the query and feedback information, existing methods are mostly query-intensive for obtaining effective attack patterns. In this work, we propose a meta attack approach that is capable of attacking a targeted model with much fewer queries. Its high queryefficiency stems from effective utilization of meta learning approaches in learning generalizable prior abstraction from the previously observed attack patterns and exploiting such prior to help infer attack patterns from only a few queries and outputs. Extensive experiments on MNIST, CIFAR10 and tiny-Imagenet demonstrate that our meta-attack method can remarkably reduce the number of model queries without sacrificing the attack performance. Besides, the obtained meta attacker is not restricted to a particular model but can be used easily with a fast adaptive ability to attack a variety of models.The code of our work is available at https://github.com/dydjw9/MetaAttack_ICLR2020/. △ Less

Submitted 14 February, 2020; v1 submitted 5 June, 2019; originally announced June 2019.

arXiv:1902.07891 [pdf, other]

doi 10.1109/TIFS.2019.29599778

Towards Real-time Eyeblink Detection in The Wild:Dataset,Theory and Practices

Authors: Guilei Hu, Yang Xiao, Zhiguo Cao, Lubin Meng, Zhiwen Fang, Joey Tianyi Zhou, Junsong Yuan

Abstract: Effective and real-time eyeblink detection is of wide-range applications, such as deception detection, drive fatigue detection, face anti-spoofing, etc. Although numerous of efforts have already been paid, most of them focus on addressing the eyeblink detection problem under the constrained indoor conditions with the relative consistent subject and environment setup. Nevertheless, towards the prac… ▽ More Effective and real-time eyeblink detection is of wide-range applications, such as deception detection, drive fatigue detection, face anti-spoofing, etc. Although numerous of efforts have already been paid, most of them focus on addressing the eyeblink detection problem under the constrained indoor conditions with the relative consistent subject and environment setup. Nevertheless, towards the practical applications eyeblink detection in the wild is more required, and of greater challenges. However, to our knowledge this has not been well studied before. In this paper, we shed the light to this research topic. A labelled eyeblink in the wild dataset (i.e., HUST-LEBW) of 673 eyeblink video samples (i.e., 381 positives, and 292 negatives) is first established by us. These samples are captured from the unconstrained movies, with the dramatic variation on human attribute, human pose, illumination condition, imaging configuration, etc. Then, we formulate eyeblink detection task as a spatial-temporal pattern recognition problem. After locating and tracking human eye using SeetaFace engine and KCF tracker respectively, a modified LSTM model able to capture the multi-scale temporal information is proposed to execute eyeblink verification. A feature extraction approach that reveals appearance and motion characteristics simultaneously is also proposed. The experiments on HUST-LEBW reveal the superiority and efficiency of our approach. It also verifies that, the existing eyeblink detection methods cannot achieve satisfactory performance in the wild. △ Less

Submitted 18 December, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

Journal ref: IEEE Transactions on Information Forensics and Security 2019

arXiv:1808.07292 [pdf, other]

XAI Beyond Classification: Interpretable Neural Clustering

Authors: Xi Peng, Yunnan Li, Ivor W. Tsang, Hongyuan Zhu, Jiancheng Lv, Joey Tianyi Zhou

Abstract: In this paper, we study two challenging problems in explainable AI (XAI) and data clustering. The first is how to directly design a neural network with inherent interpretability, rather than giving post-hoc explanations of a black-box model. The second is implementing discrete $k$-means with a differentiable neural network that embraces the advantages of parallel computing, online clustering, and… ▽ More In this paper, we study two challenging problems in explainable AI (XAI) and data clustering. The first is how to directly design a neural network with inherent interpretability, rather than giving post-hoc explanations of a black-box model. The second is implementing discrete $k$-means with a differentiable neural network that embraces the advantages of parallel computing, online clustering, and clustering-favorable representation learning. To address these two challenges, we design a novel neural network, which is a differentiable reformulation of the vanilla $k$-means, called inTerpretable nEuraL cLustering (TELL). Our contributions are threefold. First, to the best of our knowledge, most existing XAI works focus on supervised learning paradigms. This work is one of the few XAI studies on unsupervised learning, in particular, data clustering. Second, TELL is an interpretable, or the so-called intrinsically explainable and transparent model. In contrast, most existing XAI studies resort to various means for understanding a black-box model with post-hoc explanations. Third, from the view of data clustering, TELL possesses many properties highly desired by $k$-means, including but not limited to online clustering, plug-and-play module, parallel computing, and provable convergence. Extensive experiments show that our method achieves superior performance comparing with 14 clustering approaches on three challenging data sets. The source code could be accessed at \url{www.pengxi.me}. △ Less

Submitted 22 April, 2022; v1 submitted 22 August, 2018; originally announced August 2018.

Comments: 28 pages

Journal ref: Journal of Machine Learning Research, 2022

arXiv:1807.11042 [pdf, other]

Towards Good Practices on Building Effective CNN Baseline Model for Person Re-identification

Authors: Fu Xiong, Yang Xiao, Zhiguo Cao, Kaicheng Gong, Zhiwen Fang, Joey Tianyi Zhou

Abstract: Person re-identification is indeed a challenging visual recognition task due to the critical issues of human pose variation, human body occlusion, camera view variation, etc. To address this, most of the state-of-the-art approaches are proposed based on deep convolutional neural network (CNN), being leveraged by its strong feature learning power and classification boundary fitting capacity. Althou… ▽ More Person re-identification is indeed a challenging visual recognition task due to the critical issues of human pose variation, human body occlusion, camera view variation, etc. To address this, most of the state-of-the-art approaches are proposed based on deep convolutional neural network (CNN), being leveraged by its strong feature learning power and classification boundary fitting capacity. Although the vital role towards person re-identification, how to build effective CNN baseline model has not been well studied yet. To answer this open question, we propose 3 good practices in this paper from the perspectives of adjusting CNN architecture and training procedure. In particular, they are adding batch normalization after the global pooling layer, executing identity categorization directly using only one fully-connected, and using Adam as optimizer. The extensive experiments on 3 widely-used benchmark datasets demonstrate that, our propositions essentially facilitate the CNN baseline model to achieve the state-of-the-art performance without any other high-level domain knowledge or low-level technical trick. △ Less

Submitted 29 July, 2018; originally announced July 2018.

arXiv:1806.11269 [pdf, other]

Action Recognition for Depth Video using Multi-view Dynamic Images

Authors: Yang Xiao, Jun Chen, Yancheng Wang, Zhiguo Cao, Joey Tianyi Zhou, Xiang Bai

Abstract: Dynamic imaging is a recently proposed action description paradigm for simultaneously capturing motion and temporal evolution information, particularly in the context of deep convolutional neural networks (CNNs). Compared with optical flow for motion characterization, dynamic imaging exhibits superior efficiency and compactness. Inspired by the success of dynamic imaging in RGB video, this study e… ▽ More Dynamic imaging is a recently proposed action description paradigm for simultaneously capturing motion and temporal evolution information, particularly in the context of deep convolutional neural networks (CNNs). Compared with optical flow for motion characterization, dynamic imaging exhibits superior efficiency and compactness. Inspired by the success of dynamic imaging in RGB video, this study extends it to the depth domain. To better exploit three-dimensional (3D) characteristics, multi-view dynamic images are proposed. In particular, the raw depth video is densely projected with respect to different virtual imaging viewpoints by rotating the virtual camera within the 3D space. Subsequently, dynamic images are extracted from the obtained multi-view depth videos and multi-view dynamic images are thus constructed from these images. Accordingly, more view-tolerant visual cues can be involved. A novel CNN model is then proposed to perform feature learning on multi-view dynamic images. Particularly, the dynamic images from different views share the same convolutional layers but correspond to different fully connected layers. This is aimed at enhancing the tuning effectiveness on shallow convolutional layers by alleviating the gradient vanishing problem. Moreover, as the spatial occurrence variation of the actions may impair the CNN, an action proposal approach is also put forth. In experiments, the proposed approach can achieve state-of-the-art performance on three challenging datasets. △ Less

Submitted 27 December, 2018; v1 submitted 29 June, 2018; originally announced June 2018.

Comments: accepted by Information Sciences

arXiv:1702.08681 [pdf, other]

MIML-FCN+: Multi-instance Multi-label Learning via Fully Convolutional Networks with Privileged Information

Authors: Hao Yang, Joey Tianyi Zhou, Jianfei Cai, Yew Soon Ong

Abstract: Multi-instance multi-label (MIML) learning has many interesting applications in computer visions, including multi-object recognition and automatic image tagging. In these applications, additional information such as bounding-boxes, image captions and descriptions is often available during training phrase, which is referred as privileged information (PI). However, as existing works on learning usin… ▽ More Multi-instance multi-label (MIML) learning has many interesting applications in computer visions, including multi-object recognition and automatic image tagging. In these applications, additional information such as bounding-boxes, image captions and descriptions is often available during training phrase, which is referred as privileged information (PI). However, as existing works on learning using PI only consider instance-level PI (privileged instances), they fail to make use of bag-level PI (privileged bags) available in MIML learning. Therefore, in this paper, we propose a two-stream fully convolutional network, named MIML-FCN+, unified by a novel PI loss to solve the problem of MIML learning with privileged bags. Compared to the previous works on PI, the proposed MIML-FCN+ utilizes the readily available privileged bags, instead of hard-to-obtain privileged instances, making the system more general and practical in real world applications. As the proposed PI loss is convex and SGD compatible and the framework itself is a fully convolutional network, MIML-FCN+ can be easily integrated with state of-the-art deep learning networks. Moreover, the flexibility of convolutional layers allows us to exploit structured correlations among instances to facilitate more effective training and testing. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed MIML-FCN+, outperforming state-of-the-art methods in the application of multi-object recognition. △ Less

Submitted 28 February, 2017; originally announced February 2017.

Comments: Accepted in CVPR 2017

arXiv:1608.01441 [pdf, other]

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations

Authors: Hao Yang, Joey Tianyi Zhou, Jianfei Cai

Abstract: Multi-label learning has attracted significant interests in computer vision recently, finding applications in many vision tasks such as multiple object recognition and automatic image annotation. Associating multiple labels to a complex image is very difficult, not only due to the intricacy of describing the image, but also because of the incompleteness nature of the observed labels. Existing work… ▽ More Multi-label learning has attracted significant interests in computer vision recently, finding applications in many vision tasks such as multiple object recognition and automatic image annotation. Associating multiple labels to a complex image is very difficult, not only due to the intricacy of describing the image, but also because of the incompleteness nature of the observed labels. Existing works on the problem either ignore the label-label and instance-instance correlations or just assume these correlations are linear and unstructured. Considering that semantic correlations between images are actually structured, in this paper we propose to incorporate structured semantic correlations to solve the missing label problem of multi-label learning. Specifically, we project images to the semantic space with an effective semantic descriptor. A semantic graph is then constructed on these images to capture the structured correlations between them. We utilize the semantic graph Laplacian as a smooth term in the multi-label learning formulation to incorporate the structured semantic correlations. Experimental results demonstrate the effectiveness of the proposed semantic descriptor and the usefulness of incorporating the structured semantic correlations. We achieve better results than state-of-the-art multi-label learning methods on four benchmark datasets. △ Less

Submitted 4 August, 2016; originally announced August 2016.

Comments: Accepted in ECCV 2016

arXiv:1605.04034 [pdf, other]

Transfer Hashing with Privileged Information

Authors: Joey Tianyi Zhou, Xinxing Xu, Sinno Jialin Pan, Ivor W. Tsang, Zheng Qin, Rick Siow Mong Goh

Abstract: Most existing learning to hash methods assume that there are sufficient data, either labeled or unlabeled, on the domain of interest (i.e., the target domain) for training. However, this assumption cannot be satisfied in some real-world applications. To address this data sparsity issue in hashing, inspired by transfer learning, we propose a new framework named Transfer Hashing with Privileged Info… ▽ More Most existing learning to hash methods assume that there are sufficient data, either labeled or unlabeled, on the domain of interest (i.e., the target domain) for training. However, this assumption cannot be satisfied in some real-world applications. To address this data sparsity issue in hashing, inspired by transfer learning, we propose a new framework named Transfer Hashing with Privileged Information (THPI). Specifically, we extend the standard learning to hash method, Iterative Quantization (ITQ), in a transfer learning manner, namely ITQ+. In ITQ+, a new slack function is learned from auxiliary data to approximate the quantization error in ITQ. We developed an alternating optimization approach to solve the resultant optimization problem for ITQ+. We further extend ITQ+ to LapITQ+ by utilizing the geometry structure among the auxiliary data for learning more precise binary codes in the target domain. Extensive experiments on several benchmark datasets verify the effectiveness of our proposed approaches through comparisons with several state-of-the-art baselines. △ Less

Submitted 12 May, 2016; originally announced May 2016.

Comments: Accepted by IJCAI-2016

arXiv:1604.01518 [pdf, ps, other]

Simple and Efficient Learning using Privileged Information

Authors: Xinxing Xu, Joey Tianyi Zhou, IvorW. Tsang, Zheng Qin, Rick Siow Mong Goh, Yong Liu

Abstract: The Support Vector Machine using Privileged Information (SVM+) has been proposed to train a classifier to utilize the additional privileged information that is only available in the training phase but not available in the test phase. In this work, we propose an efficient solution for SVM+ by simply utilizing the squared hinge loss instead of the hinge loss as in the existing SVM+ formulation, whic… ▽ More The Support Vector Machine using Privileged Information (SVM+) has been proposed to train a classifier to utilize the additional privileged information that is only available in the training phase but not available in the test phase. In this work, we propose an efficient solution for SVM+ by simply utilizing the squared hinge loss instead of the hinge loss as in the existing SVM+ formulation, which interestingly leads to a dual form with less variables and in the same form with the dual of the standard SVM. The proposed algorithm is utilized to leverage the additional web knowledge that is only available during training for the image categorization tasks. The extensive experimental results on both Caltech101 andWebQueries datasets show that our proposed method can achieve a factor of up to hundred times speedup with the comparable accuracy when compared with the existing SVM+ method. △ Less

Submitted 6 April, 2016; originally announced April 2016.

arXiv:1603.05850 [pdf, other]

N-ary Error Correcting Coding Scheme

Authors: Joey Tianyi Zhou, Ivor W. Tsang, Shen-Shyang Ho, Klaus-Robert Muller

Abstract: The coding matrix design plays a fundamental role in the prediction performance of the error correcting output codes (ECOC)-based multi-class task. {In many-class classification problems, e.g., fine-grained categorization, it is difficult to distinguish subtle between-class differences under existing coding schemes due to a limited choices of coding values.} In this paper, we investigate whether o… ▽ More The coding matrix design plays a fundamental role in the prediction performance of the error correcting output codes (ECOC)-based multi-class task. {In many-class classification problems, e.g., fine-grained categorization, it is difficult to distinguish subtle between-class differences under existing coding schemes due to a limited choices of coding values.} In this paper, we investigate whether one can relax existing binary and ternary code design to $N$-ary code design to achieve better classification performance. {In particular, we present a novel $N$-ary coding scheme that decomposes the original multi-class problem into simpler multi-class subproblems, which is similar to applying a divide-and-conquer method.} The two main advantages of such a coding scheme are as follows: (i) the ability to construct more discriminative codes and (ii) the flexibility for the user to select the best $N$ for ECOC-based classification. We show empirically that the optimal $N$ (based on classification performance) lies in $[3, 10]$ with some trade-off in computational cost. Moreover, we provide theoretical insights on the dependency of the generalization error bound of an $N$-ary ECOC on the average base classifier generalization error and the minimum distance between any two codes constructed. Extensive experimental results on benchmark multi-class datasets show that the proposed coding scheme achieves superior prediction performance over the state-of-the-art coding methods. △ Less

Submitted 18 March, 2016; originally announced March 2016.

Comments: Under submission to IEEE Transaction on Information Theory

arXiv:1507.01101 [pdf, other]

Utility Optimal Thread Assignment and Resource Allocation in Multi-Server Systems

Authors: Pan Lai, Rui Fan, Xiao Zhang, Wei Zhang, Fang Liu, Joey Tianyi Zhou

Abstract: Achieving high performance in many multi-server systems requires finding a good assignment of worker threads to servers and also effectively allocating each server's resources to its assigned threads. The assignment and allocation components of this problem have been studied extensively but largely separately in the literature. In this paper, we introduce the assign and allocate (AA) problem, whic… ▽ More Achieving high performance in many multi-server systems requires finding a good assignment of worker threads to servers and also effectively allocating each server's resources to its assigned threads. The assignment and allocation components of this problem have been studied extensively but largely separately in the literature. In this paper, we introduce the assign and allocate (AA) problem, which seeks to simultaneously find an assignment and allocation that maximizes the total utility of the threads. Assigning and allocating the threads together can result in substantially better overall utility than performing the steps separately, as is traditionally done. We model each thread by a utility function giving its performance as a function of its assigned resources. We first prove that the AA problem is NP-hard. We then present a $2 (\sqrt{2}-1) > 0.828$ factor approximation algorithm for concave utility functions, which runs in $O(mn^2 + n (\log mC)^2)$ time for $n$ threads and $m$ servers with $C$ amount of resources each. We also give a faster algorithm with the same approximation ratio and $O(n (\log mC)^2)$ time complexity. We then extend the problem to two more general settings. First, we consider threads with nonconcave utility functions, and give a 1/2 factor approximation algorithm. Next, we give an algorithm for threads using multiple types of resources, and show the algorithm achieves good empirical performance. We conduct extensive experiments to test the performance of our algorithms on threads with both synthetic and realistic utility functions, and find that they achieve over 92\% of the optimal utility on average. We also compare our algorithms with a number of practical heuristics, and find that our algorithms achieve up to 9 times higher total utility. △ Less

Submitted 9 June, 2021; v1 submitted 4 July, 2015; originally announced July 2015.

Comments: 17 pages

ACM Class: C.1.4; D.4.2; F.2.1

arXiv:1504.05843 [pdf, other]

Exploit Bounding Box Annotations for Multi-label Object Recognition

Authors: Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, Jianfei Cai

Abstract: Convolutional neural networks (CNNs) have shown great performance as general feature representations for object recognition applications. However, for multi-label images that contain multiple objects from different categories, scales and locations, global CNN features are not optimal. In this paper, we incorporate local information to enhance the feature discriminative power. In particular, we fir… ▽ More Convolutional neural networks (CNNs) have shown great performance as general feature representations for object recognition applications. However, for multi-label images that contain multiple objects from different categories, scales and locations, global CNN features are not optimal. In this paper, we incorporate local information to enhance the feature discriminative power. In particular, we first extract object proposals from each image. With each image treated as a bag and object proposals extracted from it treated as instances, we transform the multi-label recognition problem into a multi-class multi-instance learning problem. Then, in addition to extracting the typical CNN feature representation from each proposal, we propose to make use of ground-truth bounding box annotations (strong labels) to add another level of local information by using nearest-neighbor relationships of local regions to form a multi-view pipeline. The proposed multi-view multi-instance framework utilizes both weak and strong labels effectively, and more importantly it has the generalization ability to even boost the performance of unseen categories by partial strong labels from other categories. Our framework is extensively compared with state-of-the-art hand-crafted feature based methods and CNN based methods on two multi-label benchmark datasets. The experimental results validate the discriminative power and the generalization ability of the proposed framework. With strong labels, our framework is able to achieve state-of-the-art results in both datasets. △ Less

Submitted 3 June, 2016; v1 submitted 22 April, 2015; originally announced April 2015.

Comments: Accepted in CVPR 2016

Showing 51–94 of 94 results for author: Zhou, J T