-
Visualizing and Understanding Patch Interactions in Vision Transformer
Authors:
Jie Ma,
Yalong Bai,
Bineng Zhong,
Wei Zhang,
Ting Yao,
Tao Mei
Abstract:
Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to…
▽ More
Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the effective responsive field of each patch in ViT and devise a window-free transformer architecture accordingly. Extensive experiments on ImageNet demonstrate that the exquisitely designed quantitative method is shown able to facilitate ViT model learning, leading the top-1 accuracy by 4.28% at most. Moreover, the results on downstream fine-grained recognition tasks further validate the generalization of our proposal.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Holistic Fine-grained GGS Characterization: From Detection to Unbalanced Classification
Authors:
Yuzhe Lu,
Haichun Yang,
Zuhayr Asad,
Zheyu Zhu,
Tianyuan Yao,
Jiachen Xu,
Agnes B. Fogo,
Yuankai Huo
Abstract:
Recent studies have demonstrated the diagnostic and prognostic values of global glomerulosclerosis (GGS) in IgA nephropathy, aging, and end-stage renal disease. However, the fine-grained quantitative analysis of multiple GGS subtypes (e.g., obsolescent, solidified, and disappearing glomerulosclerosis) is typically a resource extensive manual process. Very few automatic methods, if any, have been d…
▽ More
Recent studies have demonstrated the diagnostic and prognostic values of global glomerulosclerosis (GGS) in IgA nephropathy, aging, and end-stage renal disease. However, the fine-grained quantitative analysis of multiple GGS subtypes (e.g., obsolescent, solidified, and disappearing glomerulosclerosis) is typically a resource extensive manual process. Very few automatic methods, if any, have been developed to bridge this gap for such analytics. In this paper, we present a holistic pipeline to quantify GGS (with both detection and classification) from a whole slide image in a fully automatic manner. In addition, we conduct the fine-grained classification for the sub-types of GGS. Our study releases the open-source quantitative analytical tool for fine-grained GGS characterization while tackling the technical challenges in unbalanced classification and integrating detection and classification.
△ Less
Submitted 31 January, 2022;
originally announced February 2022.
-
Probabilistic Learning of Treatment Trees in Cancer
Authors:
Tsung-Hung Yao,
Zhenke Wu,
Karthik Bharath,
**ju Li,
Veerabhadran Baladandayuthapan
Abstract:
Accurate identification of synergistic treatment combinations and their underlying biological mechanisms is critical across many disease domains, especially cancer. In translational oncology research, preclinical systems such as patient-derived xenografts (PDX) have emerged as a unique study design evaluating multiple treatments administered to samples from the same human tumor implanted into gene…
▽ More
Accurate identification of synergistic treatment combinations and their underlying biological mechanisms is critical across many disease domains, especially cancer. In translational oncology research, preclinical systems such as patient-derived xenografts (PDX) have emerged as a unique study design evaluating multiple treatments administered to samples from the same human tumor implanted into genetically identical mice. In this paper, we propose a novel Bayesian probabilistic tree-based framework for PDX data to investigate the hierarchical relationships between treatments by inferring treatment cluster trees, referred to as treatment trees (Rx-tree). The framework motivates a new metric of mechanistic similarity between two or more treatments accounting for inherent uncertainty in tree estimation; treatments with a high estimated similarity have potentially high mechanistic synergy. Building upon Dirichlet Diffusion Trees, we derive a closed-form marginal likelihood encoding the tree structure, which facilitates computationally efficient posterior inference via a new two-stage algorithm. Simulation studies demonstrate superior performance of the proposed method in recovering the tree structure and treatment similarities. Our analyses of a recently collated PDX dataset produce treatment similarity estimates that show a high degree of concordance with known biological mechanisms across treatments in five different cancers. More importantly, we uncover new and potentially effective combination therapies that confer synergistic regulation of specific downstream biological pathways for future clinical investigations. Our accompanying code, data, and shiny application for visualization of results are available at: https://github.com/bayesrx/RxTree.
△ Less
Submitted 23 January, 2022;
originally announced January 2022.
-
Cross $t$-intersecting families for symplectic polar spaces
Authors:
Tian Yao,
Kaishun Wang
Abstract:
Let $\mathscr{P}$ be a symplectic polar space over a finite field $\mathbb{F}_q$, and $\mathscr{P}_m$ denote the collection of all $k$-dimensional totally isotropic subspace in $\mathscr{P}$. Let $\mathscr{F}_1\subset\mathscr{P}_{m_1}$ and $\mathscr{F}_2\subset\mathscr{P}_{m_2}$ satisfy $\dim(F_1\cap F_2)\ge t$ for any $F_1\in\mathscr{F}_1$ and $F_2\in\mathscr{F}_2$. We say they are cross $t$-inte…
▽ More
Let $\mathscr{P}$ be a symplectic polar space over a finite field $\mathbb{F}_q$, and $\mathscr{P}_m$ denote the collection of all $k$-dimensional totally isotropic subspace in $\mathscr{P}$. Let $\mathscr{F}_1\subset\mathscr{P}_{m_1}$ and $\mathscr{F}_2\subset\mathscr{P}_{m_2}$ satisfy $\dim(F_1\cap F_2)\ge t$ for any $F_1\in\mathscr{F}_1$ and $F_2\in\mathscr{F}_2$. We say they are cross $t$-intersecting families. Moreover, we say they are trivial if each member of them contains a fixed $t$-dimensional totally isotropic subspace. In this paper, we show that cross $t$-intersecting families with maximum product of sizes are trivial. We also describe the structure of non-trivial $t$-intersecting families with maximum product of sizes.
△ Less
Submitted 24 February, 2022; v1 submitted 21 January, 2022;
originally announced January 2022.
-
Cross $t$-intersecting families for finite affine spaces
Authors:
Tian Yao,
Kaishun Wang
Abstract:
Denote the collection of all $k$-flats in $AG(n,\mathbb{F}_q)$ by $\mathscr{M}(k,n)$. Let $\mathscr{F}_1\subset\mathscr{M}(k_1,n)$ and $\mathscr{F}_2\subset\mathscr{M}(k_2,n)$ satisfy $\dim(F_1\cap F_2)\ge t$ for any $F_1\in\mathscr{F}_1$ and $F_2\in\mathscr{F}_2$. We say they are cross $t$-intersecting families. Moreover, we say they are trivial if each member of them contains a fixed $t$-flats i…
▽ More
Denote the collection of all $k$-flats in $AG(n,\mathbb{F}_q)$ by $\mathscr{M}(k,n)$. Let $\mathscr{F}_1\subset\mathscr{M}(k_1,n)$ and $\mathscr{F}_2\subset\mathscr{M}(k_2,n)$ satisfy $\dim(F_1\cap F_2)\ge t$ for any $F_1\in\mathscr{F}_1$ and $F_2\in\mathscr{F}_2$. We say they are cross $t$-intersecting families. Moreover, we say they are trivial if each member of them contains a fixed $t$-flats in $AG(n,\mathbb{F}_q)$. In this paper, we show that cross $t$-intersecting families with maximum product of sizes are trivial. We also describe the structure of non-trivial $t$-intersecting families with maximum product of sizes.
△ Less
Submitted 15 February, 2022; v1 submitted 20 January, 2022;
originally announced January 2022.
-
Motion-Focused Contrastive Learning of Video Representations
Authors:
Rui Li,
Yiheng Zhang,
Zhaofan Qiu,
Ting Yao,
Dong Liu,
Tao Mei
Abstract:
Motion, as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning. In this paper, we ask the question: how important is the motion particularly for self-supervised video representation learning. To this end, we compose a duet of exploiting the motion for data augmentation and feature learning in the…
▽ More
Motion, as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning. In this paper, we ask the question: how important is the motion particularly for self-supervised video representation learning. To this end, we compose a duet of exploiting the motion for data augmentation and feature learning in the regime of contrastive learning. Specifically, we present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation. On one hand, MCL capitalizes on optical flow of each frame in a video to temporally and spatially sample the tubelets (i.e., sequences of associated frame patches across time) as data augmentations. On the other hand, MCL further aligns gradient maps of the convolutional layers to optical flow maps from spatial, temporal and spatio-temporal perspectives, in order to ground motion information in feature learning. Extensive experiments conducted on R(2+1)D backbone demonstrate the effectiveness of our MCL. On UCF101, the linear classifier trained on the representations learnt by MCL achieves 81.91% top-1 accuracy, outperforming ImageNet supervised pre-training by 6.78%. On Kinetics-400, MCL achieves 66.62% top-1 accuracy under the linear protocol. Code is available at https://github.com/YihengZhang-CV/MCL-Motion-Focused-Contrastive-Learning.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Representing Videos as Discriminative Sub-graphs for Action Recognition
Authors:
Dong Li,
Zhaofan Qiu,
Yingwei Pan,
Ting Yao,
Houqiang Li,
Tao Mei
Abstract:
Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactions and recognize the actions. In this paper, we introduce a new design of sub-graphs to represent and encode the discriminative patterns of each action in the vi…
▽ More
Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactions and recognize the actions. In this paper, we introduce a new design of sub-graphs to represent and encode the discriminative patterns of each action in the videos. Specifically, we present MUlti-scale Sub-graph LEarning (MUSLE) framework that novelly builds space-time graphs and clusters the graphs into compact sub-graphs on each scale with respect to the number of nodes. Technically, MUSLE produces 3D bounding boxes, i.e., tubelets, in each video clip, as graph nodes and takes dense connectivity as graph edges between tubelets. For each action category, we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discriminative sub-graphs as action prototypes for recognition. Extensive experiments are conducted on both Something-Something V1 & V2 and Kinetics-400 datasets, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our MUSLE achieves to-date the best reported accuracy of 65.0% on Something-Something V2 validation set.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Authors:
Yehao Li,
Jiahao Fan,
Yingwei Pan,
Ting Yao,
Weiyao Lin,
Tao Mei
Abstract:
Vision-language pre-training has been an emerging and fast-develo** research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., vis…
▽ More
Vision-language pre-training has been an emerging and fast-develo** research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality, and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Smart Director: An Event-Driven Directing System for Live Broadcasting
Authors:
Yingwei Pan,
Yue Chen,
Qian Bao,
Ning Zhang,
Ting Yao,
**gen Liu,
Tao Mei
Abstract:
Live video broadcasting normally requires a multitude of skills and expertise with domain knowledge to enable multi-camera productions. As the number of cameras keep increasing, directing a live sports broadcast has now become more complicated and challenging than ever before. The broadcast directors need to be much more concentrated, responsive, and knowledgeable, during the production. To reliev…
▽ More
Live video broadcasting normally requires a multitude of skills and expertise with domain knowledge to enable multi-camera productions. As the number of cameras keep increasing, directing a live sports broadcast has now become more complicated and challenging than ever before. The broadcast directors need to be much more concentrated, responsive, and knowledgeable, during the production. To relieve the directors from their intensive efforts, we develop an innovative automated sports broadcast directing system, called Smart Director, which aims at mimicking the typical human-in-the-loop broadcasting process to automatically create near-professional broadcasting programs in real-time by using a set of advanced multi-view video analysis algorithms. Inspired by the so-called "three-event" construction of sports broadcast, we build our system with an event-driven pipeline consisting of three consecutive novel components: 1) the Multi-view Event Localization to detect events by modeling multi-view correlations, 2) the Multi-view Highlight Detection to rank camera views by the visual importance for view selection, 3) the Auto-Broadcasting Scheduler to control the production of broadcasting videos. To our best knowledge, our system is the first end-to-end automated directing system for multi-camera sports broadcasting, completely driven by the semantic understanding of sports events. It is also the first system to solve the novel problem of multi-view joint event detection by cross-view relation modeling. We conduct both objective and subjective evaluations on a real-world multi-camera soccer dataset, which demonstrate the quality of our auto-generated videos is comparable to that of the human-directed. Thanks to its faster response, our system is able to capture more fast-passing and short-duration events which are usually missed by human directors.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Boosting Video Representation Learning with Multi-Faceted Integration
Authors:
Zhaofan Qiu,
Ting Yao,
Chong-Wah Ngo,
Xiao-** Zhang,
Dong Wu,
Tao Mei
Abstract:
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpf…
▽ More
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the "semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Condensing a Sequence to One Informative Frame for Video Recognition
Authors:
Zhaofan Qiu,
Ting Yao,
Yan Shu,
Chong-Wah Ngo,
Tao Mei
Abstract:
Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A…
▽ More
Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define "useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, i.e., appearance reconstruction, video categorization, motion estimation, and two regularizers, i.e., adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Optimization Planning for 3D ConvNets
Authors:
Zhaofan Qiu,
Ting Yao,
Chong-Wah Ngo,
Tao Mei
Abstract:
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video clips and then is followed by learning long-term temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as trainin…
▽ More
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video clips and then is followed by learning long-term temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as training progresses. The fact that such process comes along with several heuristic settings motivates the study to seek an optimal "path" to automate the entire training. In this paper, we decompose the path into a series of training "states" and specify the hyper-parameters, e.g., learning rate and the length of input clips, in each state. The estimation of the knee point on the performance-epoch curve triggers the transition from one state to another. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path. Furthermore, we devise a new 3D ConvNets with a unique design of dual-head classifier to improve spatial and temporal discrimination. Extensive experiments on seven public video recognition benchmarks demonstrate the advantages of our proposal. With the optimization planning, our 3D ConvNets achieves superior results when comparing to the state-of-the-art recognition methods. More remarkably, we obtain the top-1 accuracy of 80.5% and 82.7% on Kinetics-400 and Kinetics-600 datasets, respectively. Source code is available at https://github.com/ZhaofanQiu/Optimization-Planning-for-3D-ConvNets.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Exploiting Fine-grained Face Forgery Clues via Progressive Enhancement Learning
Authors:
Qiqi Gu,
Shen Chen,
Tai** Yao,
Yang Chen,
Shouhong Ding,
Ran Yi
Abstract:
With the rapid development of facial forgery techniques, forgery detection has attracted more and more attention due to security concerns. Existing approaches attempt to use frequency information to mine subtle artifacts under high-quality forged faces. However, the exploitation of frequency information is coarse-grained, and more importantly, their vanilla learning process struggles to extract fi…
▽ More
With the rapid development of facial forgery techniques, forgery detection has attracted more and more attention due to security concerns. Existing approaches attempt to use frequency information to mine subtle artifacts under high-quality forged faces. However, the exploitation of frequency information is coarse-grained, and more importantly, their vanilla learning process struggles to extract fine-grained forgery traces. To address this issue, we propose a progressive enhancement learning framework to exploit both the RGB and fine-grained frequency clues. Specifically, we perform a fine-grained decomposition of RGB images to completely decouple the real and fake traces in the frequency space. Subsequently, we propose a progressive enhancement learning framework based on a two-branch network, combined with self-enhancement and mutual-enhancement modules. The self-enhancement module captures the traces in different input spaces based on spatial noise enhancement and channel attention. The Mutual-enhancement module concurrently enhances RGB and frequency features by communicating in the shared spatial dimension. The progressive enhancement process facilitates the learning of discriminative features with fine-grained face forgery clues. Extensive experiments on several datasets show that our method outperforms the state-of-the-art face forgery detection methods.
△ Less
Submitted 27 December, 2021;
originally announced December 2021.
-
Responsive Listening Head Generation: A Benchmark Dataset and Baseline
Authors:
Mohan Zhou,
Yalong Bai,
Wei Zhang,
Ting Yao,
Tiejun Zhao,
Tao Mei
Abstract:
We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applicat…
▽ More
We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset "ViCo", highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired "speaking-listening" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Code & ViCo dataset: https://project.mhzhou.com/vico.
△ Less
Submitted 20 July, 2022; v1 submitted 27 December, 2021;
originally announced December 2021.
-
Dual Contrastive Learning for General Face Forgery Detection
Authors:
Ke Sun,
Tai** Yao,
Shen Chen,
Shouhong Ding,
Jilin L,
Rongrong Ji
Abstract:
With various facial manipulation techniques arising, face forgery detection has drawn growing attention due to security concerns. Previous works always formulate face forgery detection as a classification problem based on cross-entropy loss, which emphasizes category-level differences rather than the essential discrepancies between real and fake faces, limiting model generalization in unseen domai…
▽ More
With various facial manipulation techniques arising, face forgery detection has drawn growing attention due to security concerns. Previous works always formulate face forgery detection as a classification problem based on cross-entropy loss, which emphasizes category-level differences rather than the essential discrepancies between real and fake faces, limiting model generalization in unseen domains. To address this issue, we propose a novel face forgery detection framework, named Dual Contrastive Learning (DCL), which specially constructs positive and negative paired data and performs designed contrastive learning at different granularities to learn generalized feature representation. Concretely, combined with the hard sample selection strategy, Inter-Instance Contrastive Learning (Inter-ICL) is first proposed to promote task-related discriminative features learning by especially constructing instance pairs. Moreover, to further explore the essential discrepancies, Intra-Instance Contrastive Learning (Intra-ICL) is introduced to focus on the local content inconsistencies prevalent in the forged faces by constructing local-region pairs inside instances. Extensive experiments and visualizations on several datasets demonstrate the generalization of our method against the state-of-the-art competitors.
△ Less
Submitted 27 December, 2021;
originally announced December 2021.
-
Reciprocal Normalization for Domain Adaptation
Authors:
Zhiyong Huang,
Kekai Sheng,
Ke Li,
Jian Liang,
Tai** Yao,
Weiming Dong,
Dengwen Zhou,
Xing Sun
Abstract:
Batch normalization (BN) is widely used in modern deep neural networks, which has been shown to represent the domain-related knowledge, and thus is ineffective for cross-domain tasks like unsupervised domain adaptation (UDA). Existing BN variant methods aggregate source and target domain knowledge in the same channel in normalization module. However, the misalignment between the features of corres…
▽ More
Batch normalization (BN) is widely used in modern deep neural networks, which has been shown to represent the domain-related knowledge, and thus is ineffective for cross-domain tasks like unsupervised domain adaptation (UDA). Existing BN variant methods aggregate source and target domain knowledge in the same channel in normalization module. However, the misalignment between the features of corresponding channels across domains often leads to a sub-optimal transferability. In this paper, we exploit the cross-domain relation and propose a novel normalization method, Reciprocal Normalization (RN). Specifically, RN first presents a Reciprocal Compensation (RC) module to acquire the compensatory for each channel in both domains based on the cross-domain channel-wise correlation. Then RN develops a Reciprocal Aggregation (RA) module to adaptively aggregate the feature with its cross-domain compensatory components. As an alternative to BN, RN is more suitable for UDA problems and can be easily integrated into popular domain adaptation methods. Experiments show that the proposed RN outperforms existing normalization counterparts by a large margin and helps state-of-the-art adaptation approaches achieve better results. The source code is available on https://github.com/Openning07/reciprocal-normalization-for-DA.
△ Less
Submitted 20 December, 2021;
originally announced December 2021.
-
Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration
Authors:
Yu Wang,
**gyang Lin,
**g**g Zou,
Yingwei Pan,
Ting Yao,
Tao Mei
Abstract:
Our work reveals a structured shortcoming of the existing mainstream self-supervised learning methods. Whereas self-supervised learning frameworks usually take the prevailing perfect instance level invariance hypothesis for granted, we carefully investigate the pitfalls behind. Particularly, we argue that the existing augmentation pipeline for generating multiple positive views naturally introduce…
▽ More
Our work reveals a structured shortcoming of the existing mainstream self-supervised learning methods. Whereas self-supervised learning frameworks usually take the prevailing perfect instance level invariance hypothesis for granted, we carefully investigate the pitfalls behind. Particularly, we argue that the existing augmentation pipeline for generating multiple positive views naturally introduces out-of-distribution (OOD) samples that undermine the learning of the downstream tasks. Generating diverse positive augmentations on the input does not always pay off in benefiting downstream tasks. To overcome this inherent deficiency, we introduce a lightweight latent variable model UOTA, targeting the view sampling issue for self-supervised learning. UOTA adaptively searches for the most important sampling region to produce views, and provides viable choice for outlier-robust self-supervised learning approaches. Our method directly generalizes to many mainstream self-supervised learning approaches, regardless of the loss's nature contrastive or not. We empirically show UOTA's advantage over the state-of-the-art self-supervised paradigms with evident margin, which well justifies the existence of the OOD sample issue embedded in the existing approaches. Especially, we theoretically prove that the merits of the proposal boil down to guaranteed estimator variance and bias reduction. Code is available: at https://github.com/ssl-codelab/uota.
△ Less
Submitted 15 December, 2021;
originally announced December 2021.
-
A Style and Semantic Memory Mechanism for Domain Generalization
Authors:
Yang Chen,
Yu Wang,
Yingwei Pan,
Ting Yao,
Xinmei Tian,
Tao Mei
Abstract:
Mainstream state-of-the-art domain generalization algorithms tend to prioritize the assumption on semantic invariance across domains. Meanwhile, the inherent intra-domain style invariance is usually underappreciated and put on the shelf. In this paper, we reveal that leveraging intra-domain style invariance is also of pivotal importance in improving the efficiency of domain generalization. We veri…
▽ More
Mainstream state-of-the-art domain generalization algorithms tend to prioritize the assumption on semantic invariance across domains. Meanwhile, the inherent intra-domain style invariance is usually underappreciated and put on the shelf. In this paper, we reveal that leveraging intra-domain style invariance is also of pivotal importance in improving the efficiency of domain generalization. We verify that it is critical for the network to be informative on what domain features are invariant and shared among instances, so that the network sharpens its understanding and improves its semantic discriminative ability. Correspondingly, we also propose a novel "jury" mechanism, which is particularly effective in learning useful semantic feature commonalities among domains. Our complete model called STEAM can be interpreted as a novel probabilistic graphical model, for which the implementation requires convenient constructions of two kinds of memory banks: semantic feature bank and style feature bank. Empirical results show that our proposed framework surpasses the state-of-the-art methods by clear margins.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Transferrable Contrastive Learning for Visual Domain Adaptation
Authors:
Yang Chen,
Yingwei Pan,
Yu Wang,
Ting Yao,
Xinmei Tian,
Tao Mei
Abstract:
Self-supervised learning (SSL) has recently become the favorite among feature learning methodologies. It is therefore appealing for domain adaptation approaches to consider incorporating SSL. The intuition is to enforce instance-level feature consistency such that the predictor becomes somehow invariant across domains. However, most existing SSL methods in the regime of domain adaptation usually a…
▽ More
Self-supervised learning (SSL) has recently become the favorite among feature learning methodologies. It is therefore appealing for domain adaptation approaches to consider incorporating SSL. The intuition is to enforce instance-level feature consistency such that the predictor becomes somehow invariant across domains. However, most existing SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary components, leaving the signatures of domain adaptation unattended. Actually, the optimal region where the domain gap vanishes and the instance level constraint that SSL peruses may not coincide at all. From this point, we present a particular paradigm of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive Learning (TCL), which links the SSL and the desired cross-domain transferability congruently. We find contrastive learning intrinsically a suitable candidate for domain adaptation, as its instance invariance assumption can be conveniently promoted to cross-domain class-level invariance favored by domain adaptation tasks. Based on particular memory bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class domain discrepancy between source and target through a clean and novel contrastive loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL relies on a moving-averaged key encoder that naturally achieves a temporally ensembled version of pseudo labels for target data, which avoids pseudo label error propagation at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet) for both single-source and multi-source domain adaptation tasks, TCL has demonstrated state-of-the-art performances.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Authors:
Jianjie Luo,
Yehao Li,
Yingwei Pan,
Ting Yao,
Hongyang Chao,
Tao Mei
Abstract:
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that…
▽ More
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning
Authors:
**gyang Lin,
Yingwei Pan,
Rongfeng Lai,
Xuehang Yang,
Hongyang Chao,
Ting Yao
Abstract:
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text probl…
▽ More
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue. CORE first leverages a vanilla relation block to model the relations among all text proposals (sub-texts of multiple text instances) and further enhances relational reasoning via instance-level sub-text discrimination in a contrastive manner. Such way naturally learns instance-aware representations of text proposals and thus facilitates scene text detection. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text. Extensive experiments on four benchmarks demonstrate the superiority of CORE-Text. Code is available: \url{https://github.com/jylins/CORE-Text}.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
A Riemannian Inexact Newton Dogleg Method for Constructing a Symmetric Nonnegative Matrix with Prescribed Spectrum
Authors:
Zhi Zhao,
Teng-Teng Yao,
Zheng-Jian Bai,
Xiao-Qing **
Abstract:
This paper is concerned with the inverse problem of constructing a symmetric nonnegative matrix from realizable spectrum. We reformulate the inverse problem as an underdetermined nonlinear matrix equation over a Riemannian product manifold. To solve it, we develop a Riemannian underdetermined inexact Newton dogleg method for solving a general underdetermined nonlinear equation defined between Riem…
▽ More
This paper is concerned with the inverse problem of constructing a symmetric nonnegative matrix from realizable spectrum. We reformulate the inverse problem as an underdetermined nonlinear matrix equation over a Riemannian product manifold. To solve it, we develop a Riemannian underdetermined inexact Newton dogleg method for solving a general underdetermined nonlinear equation defined between Riemannian manifolds and Euclidean spaces. The global and quadratic convergence of the proposed method is established under some mild assumptions. Then we solve the inverse problem by applying the proposed method to its equivalent nonlinear matrix equation and a preconditioner for the perturbed normal Riemannian Newton equation is also constructed. Numerical tests show the efficiency of the proposed method for solving the inverse problem.
△ Less
Submitted 29 October, 2021;
originally announced October 2021.
-
Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles
Authors:
Tianyi Yao,
Minjie Wang,
Genevera I. Allen
Abstract:
Gaussian graphical models provide a powerful framework for uncovering conditional dependence relationships between sets of nodes; they have found applications in a wide variety of fields including sensor and communication networks, physics, finance, and computational biology. Often, one observes data on the nodes and the task is to learn the graph structure, or perform graphical model selection. W…
▽ More
Gaussian graphical models provide a powerful framework for uncovering conditional dependence relationships between sets of nodes; they have found applications in a wide variety of fields including sensor and communication networks, physics, finance, and computational biology. Often, one observes data on the nodes and the task is to learn the graph structure, or perform graphical model selection. While this is a well-studied problem with many popular techniques, there are typically three major practical challenges: i) many existing algorithms become computationally intractable in huge-data settings with tens of thousands of nodes; ii) the need for separate data-driven hyperparameter tuning considerably adds to the computational burden; iii) the statistical accuracy of selected edges often deteriorates as the dimension and/or the complexity of the underlying graph structures increase. We tackle these problems by develo** the novel Minipatch Graph (MPGraph) estimator. Our approach breaks up the huge graph learning problem into many smaller problems by creating an ensemble of tiny random subsets of both the observations and the nodes, termed minipatches. We then leverage recent advances that use hard thresholding to solve the latent variable graphical model problem to consistently learn the graph on each minipatch. Our approach is computationally fast, embarrassingly parallelizable, memory efficient, and has integrated stability-based hyperparamter tuning. Additionally, we prove that under weaker assumptions than that of the Graphical Lasso, our MPGraph estimator achieves graph selection consistency. We compare our approach to state-of-the-art computational approaches for Gaussian graphical model selection including the BigQUIC algorithm, and empirically demonstrate that our approach is not only more statistically accurate but also extensively faster for huge graph learning problems.
△ Less
Submitted 2 January, 2023; v1 submitted 22 October, 2021;
originally announced October 2021.
-
Exploration of interacting dynamical dark energy model with interaction term including the equation-of-state parameter: alleviation of the $H_{0}$ tension
Authors:
Rui-Yun Guo,
Lu Feng,
Tian-Ying Yao,
Xing-Yu Chen
Abstract:
We explore a scenario of interacting dynamical dark energy model with the interaction term $Q$ including the varying equation-of-state parameter $w$. Using the data combination of the cosmic microwave background, the baryon acoustic oscillation, and the type Ia supernovae, to global fit the interacting dynamical dark energy model, we find that adding a factor of the varying $w$ in the function of…
▽ More
We explore a scenario of interacting dynamical dark energy model with the interaction term $Q$ including the varying equation-of-state parameter $w$. Using the data combination of the cosmic microwave background, the baryon acoustic oscillation, and the type Ia supernovae, to global fit the interacting dynamical dark energy model, we find that adding a factor of the varying $w$ in the function of $Q$ can change correlations between the coupling constant $β$ and other parameters, and then has a huge impact on the fitting result of $β$. In this model, the fitting value of $H_{0}$ is lower at the $3.54 σ$ level than the direct measurement value of $H_{0}$ . Comparing to the case of interacting dynamical dark energy model with $Q$ excluding $w$, the model with $Q$ including the constant $w$ is more favored by the current mainstream observation. To obtain higher fitting values of $H_{0}$ and narrow the discrepancy of $H_{0}$ between different observations, additional parameters including the effective number of relativistic species, the total neutrino mass, and massive sterile neutrinos are considered in the interacting dynamical dark energy cosmology. We find that the $H_{0}$ tension can be further reduced in these models, but is still at the about $3 σ$ level.
△ Less
Submitted 20 December, 2021; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Topological defect-propelled swimming of nematic colloids
Authors:
Tianyi Yao,
Žiga Kos,
Yimin Luo,
Edward B. Steager,
Miha Ravnik,
Kathleen J. Stebe
Abstract:
Non-equilibrium dynamics of topological defects can be used as a fundamental propulsion mechanism in microscopic active matter. Here, we demonstrate swimming of topological defect-propelled colloidal particles in (passive) nematic fluids through experiments and numerical simulations. Dynamic swim strokes of the topological defects are driven by colloidal rotation in an external magnetic field, cau…
▽ More
Non-equilibrium dynamics of topological defects can be used as a fundamental propulsion mechanism in microscopic active matter. Here, we demonstrate swimming of topological defect-propelled colloidal particles in (passive) nematic fluids through experiments and numerical simulations. Dynamic swim strokes of the topological defects are driven by colloidal rotation in an external magnetic field, causing periodic defect rearrangement which propels the particles. The swimming velocity is determined by the colloid's angular velocity, sense of rotation and defect polarity. By controlling them we can locomote the particles along different trajectories. We demonstrate scattering -- that is, the effective pair interactions -- of two of our defect-propelled swimmers, which we show is highly anisotropic and depends on the microscopic structure of the defect stroke, including the local defect topology and polarity. More generally, this work aims to develop biomimetic active matter based on the underlying relevance of topology.
△ Less
Submitted 29 September, 2021;
originally announced September 2021.
-
Distributed Estimation of Sparse Inverse Covariances
Authors:
Tong Yao,
Shreyas Sundaram
Abstract:
Learning the relationships between various entities from time-series data is essential in many applications. Gaussian graphical models have been studied to infer these relationships. However, existing algorithms process data in a batch at a central location, limiting their applications in scenarios where data is gathered by different agents. In this paper, we propose a distributed sparse inverse c…
▽ More
Learning the relationships between various entities from time-series data is essential in many applications. Gaussian graphical models have been studied to infer these relationships. However, existing algorithms process data in a batch at a central location, limiting their applications in scenarios where data is gathered by different agents. In this paper, we propose a distributed sparse inverse covariance algorithm to learn the network structure (i.e., dependencies among observed entities) in real-time from data collected by distributed agents. Our approach is built on an online graphical alternating minimization algorithm, augmented with a consensus term that allows agents to learn the desired structure cooperatively. We allow the system designer to select the number of communication rounds and optimization steps per data point. We characterize the rate of convergence of our algorithm and provide simulations on synthetic datasets.
△ Less
Submitted 30 September, 2021; v1 submitted 24 September, 2021;
originally announced September 2021.
-
Spatiotemporal Inconsistency Learning for DeepFake Video Detection
Authors:
Zhihao Gu,
Yang Chen,
Tai** Yao,
Shouhong Ding,
Jilin Li,
Feiyue Huang,
Lizhuang Ma
Abstract:
The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To addre…
▽ More
The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To address this issue, we term this task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation. Moreover, our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the state-of-the-art competitors.
△ Less
Submitted 11 October, 2021; v1 submitted 4 September, 2021;
originally announced September 2021.
-
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Authors:
Yehao Li,
Yingwei Pan,
**gwen Chen,
Ting Yao,
Tao Mei
Abstract:
With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analyti…
▽ More
With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler -- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
A Low Rank Promoting Prior for Unsupervised Contrastive Learning
Authors:
Yu Wang,
**gyang Lin,
Qi Cai,
Yingwei Pan,
Ting Yao,
Hongyang Chao,
Tao Mei
Abstract:
Unsupervised learning is just at a tip** point where it could really take off. Among these approaches, contrastive learning has seen tremendous progress and led to state-of-the-art performance. In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC. In contrast t…
▽ More
Unsupervised learning is just at a tip** point where it could really take off. Among these approaches, contrastive learning has seen tremendous progress and led to state-of-the-art performance. In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC. In contrast to the existing conventional self-supervised approaches that only considers independent learning, our hypothesis explicitly requires that all the samples belonging to the same instance class lie on the same subspace with small dimension. This heuristic poses particular joint learning constraints to reduce the degree of freedom of the problem during the search of the optimal network parameterization. Most importantly, we argue that the low rank prior employed here is not unique, and many different priors can be invoked in a similar probabilistic way, corresponding to different hypotheses about underlying truth behind the contrastive features. Empirical evidences show that the proposed algorithm clearly surpasses the state-of-the-art approaches on multiple benchmarks, including image classification, object detection, instance segmentation and keypoint detection.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Adaptive Normalized Representation Learning for Generalizable Face Anti-Spoofing
Authors:
Shubao Liu,
Ke-Yue Zhang,
Tai** Yao,
Mingwei Bi,
Shouhong Ding,
Jilin Li,
Feiyue Huang,
Lizhuang Ma
Abstract:
With various face presentation attacks arising under unseen scenarios, face anti-spoofing (FAS) based on domain generalization (DG) has drawn growing attention due to its robustness. Most existing methods utilize DG frameworks to align the features to seek a compact and generalized feature space. However, little attention has been paid to the feature extraction process for the FAS task, especially…
▽ More
With various face presentation attacks arising under unseen scenarios, face anti-spoofing (FAS) based on domain generalization (DG) has drawn growing attention due to its robustness. Most existing methods utilize DG frameworks to align the features to seek a compact and generalized feature space. However, little attention has been paid to the feature extraction process for the FAS task, especially the influence of normalization, which also has a great impact on the generalization of the learned representation. To address this issue, we propose a novel perspective of face anti-spoofing that focuses on the normalization selection in the feature extraction process. Concretely, an Adaptive Normalized Representation Learning (ANRL) framework is devised, which adaptively selects feature normalization methods according to the inputs, aiming to learn domain-agnostic and discriminative representation. Moreover, to facilitate the representation learning, Dual Calibration Constraints are designed, including Inter-Domain Compatible loss and Inter-Class Separable loss, which provide a better optimization direction for generalizable representation. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the SOTA competitors.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Contextual Transformer Networks for Visual Recognition
Authors:
Yehao Li,
Ting Yao,
Yingwei Pan,
Tao Mei
Abstract:
Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and…
▽ More
Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a $3\times3$ convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive $1\times1$ convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each $3\times3$ convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection and instance segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at \url{https://github.com/JDAI-CV/CoTNet}.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
Structure Destruction and Content Combination for Face Anti-Spoofing
Authors:
Ke-Yue Zhang,
Tai** Yao,
Jian Zhang,
Shice Liu,
Bangjie Yin,
Shouhong Ding,
Jilin Li
Abstract:
In pursuit of consolidating the face verification systems, prior face anti-spoofing studies excavate the hidden cues in original images to discriminate real persons and diverse attack types with the assistance of auxiliary supervision. However, limited by the following two inherent disturbances in their training process: 1) Complete facial structure in a single image. 2) Implicit subdomains in the…
▽ More
In pursuit of consolidating the face verification systems, prior face anti-spoofing studies excavate the hidden cues in original images to discriminate real persons and diverse attack types with the assistance of auxiliary supervision. However, limited by the following two inherent disturbances in their training process: 1) Complete facial structure in a single image. 2) Implicit subdomains in the whole dataset, these methods are prone to stick on memorization of the entire training dataset and show sensitivity to nonhomologous domain distribution. In this paper, we propose Structure Destruction Module and Content Combination Module to address these two imitations separately. The former mechanism destroys images into patches to construct a non-structural input, while the latter mechanism recombines patches from different subdomains or classes into a mixup construct. Based on this splitting-and-splicing operation, Local Relation Modeling Module is further proposed to model the second-order relationship between patches. We evaluate our method on extensive public datasets and promising experimental results to demonstrate the reliability of our method against state-of-the-art competitors.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Compound Figure Separation of Biomedical Images with Side Loss
Authors:
Tianyuan Yao,
Chang Qu,
Quan Liu,
Ruining Deng,
Yuanhan Tian,
Jiachen Xu,
Aadarsh Jha,
Shunxing Bao,
Mengyang Zhao,
Agnes B. Fogo,
Bennett A. Landman,
Catie Chang,
Haichun Yang,
Yuankai Huo
Abstract:
Unsupervised learning algorithms (e.g., self-supervised learning, auto-encoder, contrastive learning) allow deep learning models to learn effective image representations from large-scale unlabeled data. In medical image analysis, even unannotated data can be difficult to obtain for individual labs. Fortunately, national-level efforts have been made to provide efficient access to obtain biomedical…
▽ More
Unsupervised learning algorithms (e.g., self-supervised learning, auto-encoder, contrastive learning) allow deep learning models to learn effective image representations from large-scale unlabeled data. In medical image analysis, even unannotated data can be difficult to obtain for individual labs. Fortunately, national-level efforts have been made to provide efficient access to obtain biomedical image data from previous scientific publications. For instance, NIH has launched the Open-i search engine that provides a large-scale image database with free access. However, the images in scientific publications consist of a considerable amount of compound figures with subplots. To extract and curate individual subplots, many different compound figure separation approaches have been developed, especially with the recent advances in deep learning. However, previous approaches typically required resource extensive bounding box annotation to train detection models. In this paper, we propose a simple compound figure separation (SimCFS) framework that uses weak classification annotations from individual images. Our technical contribution is three-fold: (1) we introduce a new side loss that is designed for compound figure separation; (2) we introduce an intra-class image augmentation method to simulate hard cases; (3) the proposed framework enables an efficient deployment to new classes of images, without requiring resource extensive bounding box annotations. From the results, the SimCFS achieved a new state-of-the-art performance on the ImageCLEF 2016 Compound Figure Separation Database. The source code of SimCFS is made publicly available at https://github.com/hrlblab/ImageSeperation.
△ Less
Submitted 19 July, 2021;
originally announced July 2021.
-
Dual Reweighting Domain Generalization for Face Presentation Attack Detection
Authors:
Shubao Liu,
Ke-Yue Zhang,
Tai** Yao,
Kekai Sheng,
Shouhong Ding,
Ying Tai,
Jilin Li,
Yuan Xie,
Lizhuang Ma
Abstract:
Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them e…
▽ More
Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them equally will corrupt the generalization ability. To settle the issue, we propose a novel Dual Reweighting Domain Generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization. Concretely, Sample Reweighting Module is first proposed to identify samples with relatively large domain bias, and reduce their impact on the overall optimization. Afterwards, Feature Reweighting Module is introduced to focus on these samples and extract more domain-irrelevant features via a self-distilling mechanism. Combined with the domain discriminator, the iteration of the two modules promotes the extraction of generalized features. Extensive experiments and visualizations are presented to demonstrate the effectiveness and interpretability of our method against the state-of-the-art competitors.
△ Less
Submitted 30 June, 2021;
originally announced June 2021.
-
VoxelEmbed: 3D Instance Segmentation and Tracking with Voxel Embedding based Deep Learning
Authors:
Mengyang Zhao,
Quan Liu,
Aadarsh Jha,
Ruining Deng,
Tianyuan Yao,
Anita Mahadevan-Jansen,
Matthew J. Tyska,
Bryan A. Millis,
Yuankai Huo
Abstract:
Recent advances in bioimaging have provided scientists a superior high spatial-temporal resolution to observe dynamics of living cells as 3D volumetric videos. Unfortunately, the 3D biomedical video analysis is lagging, impeded by resource insensitive human curation using off-the-shelf 3D analytic tools. Herein, biologists often need to discard a considerable amount of rich 3D spatial information…
▽ More
Recent advances in bioimaging have provided scientists a superior high spatial-temporal resolution to observe dynamics of living cells as 3D volumetric videos. Unfortunately, the 3D biomedical video analysis is lagging, impeded by resource insensitive human curation using off-the-shelf 3D analytic tools. Herein, biologists often need to discard a considerable amount of rich 3D spatial information by compromising on 2D analysis via maximum intensity projection. Recently, pixel embedding-based cell instance segmentation and tracking provided a neat and generalizable computing paradigm for understanding cellular dynamics. In this work, we propose a novel spatial-temporal voxel-embedding (VoxelEmbed) based learning method to perform simultaneous cell instance segmenting and tracking on 3D volumetric video sequences. Our contribution is in four-fold: (1) The proposed voxel embedding generalizes the pixel embedding with 3D context information; (2) Present a simple multi-stream learning approach that allows effective spatial-temporal embedding; (3) Accomplished an end-to-end framework for one-stage 3D cell instance segmentation and tracking without heavy parameter tuning; (4) The proposed 3D quantification is memory efficient via a single GPU with 12 GB memory. We evaluate our VoxelEmbed method on four 3D datasets (with different cell types) from the ISBI Cell Tracking Challenge. The proposed VoxelEmbed method achieved consistent superior overall performance (OP) on two densely annotated datasets. The performance is also competitive on two sparsely annotated cohorts with 20.6% and 2% of data-set having segmentation annotations. The results demonstrate that the VoxelEmbed method is a generalizable and memory-efficient solution.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition
Authors:
Bangjie Yin,
Wenxuan Wang,
Tai** Yao,
Junfeng Guo,
Zelun Kong,
Shouhong Ding,
Jilin Li,
Cong Liu
Abstract:
Deep neural networks, particularly face recognition models, have been shown to be vulnerable to both digital and physical adversarial examples. However, existing adversarial examples against face recognition systems either lack transferability to black-box models, or fail to be implemented in practice. In this paper, we propose a unified adversarial face generation method - Adv-Makeup, which can r…
▽ More
Deep neural networks, particularly face recognition models, have been shown to be vulnerable to both digital and physical adversarial examples. However, existing adversarial examples against face recognition systems either lack transferability to black-box models, or fail to be implemented in practice. In this paper, we propose a unified adversarial face generation method - Adv-Makeup, which can realize imperceptible and transferable attack under black-box setting. Adv-Makeup develops a task-driven makeup generation method with the blending module to synthesize imperceptible eye shadow over the orbital region on faces. And to achieve transferability, Adv-Makeup implements a fine-grained meta-learning adversarial attack strategy to learn more general attack features from various models. Compared to existing techniques, sufficient visualization results demonstrate that Adv-Makeup is capable to generate much more imperceptible attacks under both digital and physical scenarios. Meanwhile, extensive quantitative experiments show that Adv-Makeup can significantly improve the attack success rate under black-box setting, even attacking commercial systems.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
Local Relation Learning for Face Forgery Detection
Authors:
Shen Chen,
Tai** Yao,
Yang Chen,
Shouhong Ding,
Jilin Li,
Rongrong Ji
Abstract:
With the rapid development of facial manipulation techniques, face forgery detection has received considerable attention in digital media forensics due to security concerns. Most existing methods formulate face forgery detection as a classification problem and utilize binary labels or manipulated region masks as supervision. However, without considering the correlation between local regions, these…
▽ More
With the rapid development of facial manipulation techniques, face forgery detection has received considerable attention in digital media forensics due to security concerns. Most existing methods formulate face forgery detection as a classification problem and utilize binary labels or manipulated region masks as supervision. However, without considering the correlation between local regions, these global supervisions are insufficient to learn a generalized feature and prone to overfitting. To address this issue, we propose a novel perspective of face forgery detection via local relation learning. Specifically, we propose a Multi-scale Patch Similarity Module (MPSM), which measures the similarity between features of local regions and forms a robust and generalized similarity pattern. Moreover, we propose an RGB-Frequency Attention Module (RFAM) to fuse information in both RGB and frequency domains for more comprehensive local feature representation, which further improves the reliability of the similarity pattern. Extensive experiments show that the proposed method consistently outperforms the state-of-the-arts on widely-used benchmarks. Furthermore, detailed visualization shows the robustness and interpretability of our method.
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
Generalizable Representation Learning for Mixture Domain Face Anti-Spoofing
Authors:
Zhihong Chen,
Tai** Yao,
Kekai Sheng,
Shouhong Ding,
Ying Tai,
Jilin Li,
Feiyue Huang,
Xinyu **
Abstract:
Face anti-spoofing approach based on domain generalization(DG) has drawn growing attention due to its robustness forunseen scenarios. Existing DG methods assume that the do-main label is known.However, in real-world applications, thecollected dataset always contains mixture domains, where thedomain label is unknown. In this case, most of existing meth-ods may not work. Further, even if we can obta…
▽ More
Face anti-spoofing approach based on domain generalization(DG) has drawn growing attention due to its robustness forunseen scenarios. Existing DG methods assume that the do-main label is known.However, in real-world applications, thecollected dataset always contains mixture domains, where thedomain label is unknown. In this case, most of existing meth-ods may not work. Further, even if we can obtain the domainlabel as existing methods, we think this is just a sub-optimalpartition. To overcome the limitation, we propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels, which iteratively divides mixture domains viadiscriminative domain representation and trains a generaliz-able face anti-spoofing with meta-learning. Specifically, wedesign a domain feature based on Instance Normalization(IN) and propose a domain representation learning module(DRLM) to extract discriminative domain features for cluster-ing. Moreover, to reduce the side effect of outliers on cluster-ing performance, we additionally utilize maximum mean dis-crepancy (MMD) to align the distribution of sample featuresto a prior distribution, which improves the reliability of clus tering. Extensive experiments show that the proposed methodoutperforms conventional DG-based face anti-spoofing meth-ods, including those utilizing domain labels. Furthermore, weenhance the interpretability through visualizatio
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
Large non-trivial $t$-intersecting families for signed sets
Authors:
Tian Yao,
Benjian Lv,
Kaishun Wang
Abstract:
For positive integers $n,r,k$ with $n\ge r$ and $k\ge2$, a set $\{(x_1,y_1),(x_2,y_2),\dots,(x_r,y_r)\}$ is called a $k$-signed $r$-set on $[n]$ if $x_1,\dots,x_r$ are distinct elements of $[n]$ and $y_1\dots,y_r\in[k]$. We say a $t$-intersecting family consisting of $k$-signed $r$-sets on $[n]$ is trivial if each member of this family contains a fixed $k$-signed $t$-set. In this paper, we determi…
▽ More
For positive integers $n,r,k$ with $n\ge r$ and $k\ge2$, a set $\{(x_1,y_1),(x_2,y_2),\dots,(x_r,y_r)\}$ is called a $k$-signed $r$-set on $[n]$ if $x_1,\dots,x_r$ are distinct elements of $[n]$ and $y_1\dots,y_r\in[k]$. We say a $t$-intersecting family consisting of $k$-signed $r$-sets on $[n]$ is trivial if each member of this family contains a fixed $k$-signed $t$-set. In this paper, we determine the structure of large maximal non-trivial $t$-intersecting families. In particular, we characterize the non-trivial $t$-intersecting families with maximum size for $t\ge2$, extending a Hilton-Milner-type result for signed sets given by Borg.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Delving into Data: Effectively Substitute Training for Black-box Attack
Authors:
Wenxuan Wang,
Bangjie Yin,
Tai** Yao,
Li Zhang,
Yanwei Fu,
Shouhong Ding,
Jilin Li,
Feiyue Huang,
Xiangyang Xue
Abstract:
Deep models have shown their vulnerability when processing adversarial samples. As for the black-box attack, without access to the architecture and weights of the attacked model, training a substitute model for adversarial attacks has attracted wide attention. Previous substitute training approaches focus on stealing the knowledge of the target model based on real training data or synthetic data,…
▽ More
Deep models have shown their vulnerability when processing adversarial samples. As for the black-box attack, without access to the architecture and weights of the attacked model, training a substitute model for adversarial attacks has attracted wide attention. Previous substitute training approaches focus on stealing the knowledge of the target model based on real training data or synthetic data, without exploring what kind of data can further improve the transferability between the substitute and target models. In this paper, we propose a novel perspective substitute training that focuses on designing the distribution of data used in the knowledge stealing process. More specifically, a diverse data generation module is proposed to synthesize large-scale data with wide distribution. And adversarial substitute training strategy is introduced to focus on the data distributed near the decision boundary. The combination of these two modules can further boost the consistency of the substitute model and target model, which greatly improves the effectiveness of adversarial attack. Extensive experiments demonstrate the efficacy of our method against state-of-the-art competitors under non-target and target attack settings. Detailed visualization and analysis are also provided to help understand the advantage of our method.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
Understanding Fission Gas Bubble Distribution, Lanthanide Transportation, and Thermal Conductivity Degradation in Neutron-irradiated α-U Using Machine Learning
Authors:
Lu Cai,
Fei Xu,
Fidelma Dilemma,
Daniel J. Murray,
Cynthia A. Adkins,
Larry K Aagesen Jr,
Min Xian,
Luca Caprriot,
Tiankai Yao
Abstract:
UZr based metallic nuclear fuel is the leading candidate for next-generation sodium-cooled fast reactors in the United States. US research reactors have been using and testing this fuel type since the 1960s and accumulated considerable experience and knowledge about the fuel performance. However, most of knowledge remains empirical. The lack of mechanistic understanding of fuel performance is prev…
▽ More
UZr based metallic nuclear fuel is the leading candidate for next-generation sodium-cooled fast reactors in the United States. US research reactors have been using and testing this fuel type since the 1960s and accumulated considerable experience and knowledge about the fuel performance. However, most of knowledge remains empirical. The lack of mechanistic understanding of fuel performance is preventing the qualification of UZr fuel for commercial use. This paper proposes a data-driven approach, coupled with advanced post irradiation examination, powered by machine learning algorithms, to facilitate the development of such understandings by providing unpreceded quantified new insights into fission gas bubbles. Specifically, based on the advanced postirradiation examination data collected on a neutron-irradiated U-10Zr annular fuel, we developed a method to automatically detect, classify ~19,000 fission gas bubbles into different categories, and quantitatively link the data to lanthanide transpiration along the radial temperature gradient. The approach is versatile and can be modified to study different coupled irradiation effects, such as secondary phase redistribution and degradation of thermal conductivity, in irradiated nuclear fuel.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Deep Learning Techniques for In-Crop Weed Identification: A Review
Authors:
Kun Hu,
Zhiyong Wang,
Guy Coleman,
Asher Bender,
Tingting Yao,
Shan Zeng,
Dezhen Song,
Arnold Schumann,
Michael Walsh
Abstract:
Weeds are a significant threat to the agricultural productivity and the environment. The increasing demand for sustainable agriculture has driven innovations in accurate weed control technologies aimed at reducing the reliance on herbicides. With the great success of deep learning in various vision tasks, many promising image-based weed detection algorithms have been developed. This paper reviews…
▽ More
Weeds are a significant threat to the agricultural productivity and the environment. The increasing demand for sustainable agriculture has driven innovations in accurate weed control technologies aimed at reducing the reliance on herbicides. With the great success of deep learning in various vision tasks, many promising image-based weed detection algorithms have been developed. This paper reviews recent developments of deep learning techniques in the field of image-based weed detection. The review begins with an introduction to the fundamentals of deep learning related to weed detection. Next, recent progresses on deep weed detection are reviewed with the discussion of the research materials including public weed datasets. Finally, the challenges of develo** practically deployable weed detection methods are summarized, together with the discussions of the opportunities for future research.We hope that this review will provide a timely survey of the field and attract more researchers to address this inter-disciplinary research problem.
△ Less
Submitted 4 February, 2024; v1 submitted 27 March, 2021;
originally announced March 2021.
-
SimTriplet: Simple Triplet Representation Learning with a Single GPU
Authors:
Quan Liu,
Peter C. Louis,
Yuzhe Lu,
Aadarsh Jha,
Mengyang Zhao,
Ruining Deng,
Tianyuan Yao,
Joseph T. Roland,
Haichun Yang,
Shilin Zhao,
Lee E. Wheless,
Yuankai Huo
Abstract:
Contrastive learning is a key technique of modern self-supervised learning. The broader accessibility of earlier approaches is hindered by the need of heavy computational resources (e.g., at least 8 GPUs or 32 TPU cores), which accommodate for large-scale negative samples or momentum. The more recent SimSiam approach addresses such key limitations via stop-gradient without momentum encoders. In me…
▽ More
Contrastive learning is a key technique of modern self-supervised learning. The broader accessibility of earlier approaches is hindered by the need of heavy computational resources (e.g., at least 8 GPUs or 32 TPU cores), which accommodate for large-scale negative samples or momentum. The more recent SimSiam approach addresses such key limitations via stop-gradient without momentum encoders. In medical image analysis, multiple instances can be achieved from the same patient or tissue. Inspired by these advances, we propose a simple triplet representation learning (SimTriplet) approach on pathological images. The contribution of the paper is three-fold: (1) The proposed SimTriplet method takes advantage of the multi-view nature of medical images beyond self-augmentation; (2) The method maximizes both intra-sample and inter-sample similarities via triplets from positive pairs, without using negative samples; and (3) The recent mix precision training is employed to advance the training by only using a single GPU with 16GB memory. By learning from 79,000 unlabeled pathological patch images, SimTriplet achieved 10.58% better performance compared with supervised learning. It also achieved 2.13% better performance compared with SimSiam. Our proposed SimTriplet can achieve decent performance using only 1% labeled data. The code and data are available at https://github.com/hrlblab/SimTriple.
△ Less
Submitted 9 March, 2021;
originally announced March 2021.
-
DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results
Authors:
Liming Jiang,
Zhengkui Guo,
Wayne Wu,
Zhaoyang Liu,
Ziwei Liu,
Chen Change Loy,
Shuo Yang,
Yuanjun Xiong,
Wei Xia,
Baoying Chen,
Peiyu Zhuang,
Sili Li,
Shen Chen,
Tai** Yao,
Shouhong Ding,
Jilin Li,
Feiyue Huang,
Liujuan Cao,
Rongrong Ji,
Changlei Lu,
Ganchao Tan
Abstract:
This paper reports methods and results in the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge employs the DeeperForensics-1.0 dataset, one of the most extensive publicly available real-world face forgery detection datasets, with 60,000 videos constituted by a total of 17.6 million frames. The model evaluation is conducted online on a high-quality hidden test set…
▽ More
This paper reports methods and results in the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge employs the DeeperForensics-1.0 dataset, one of the most extensive publicly available real-world face forgery detection datasets, with 60,000 videos constituted by a total of 17.6 million frames. The model evaluation is conducted online on a high-quality hidden test set with multiple sources and diverse distortions. A total of 115 participants registered for the competition, and 25 teams made valid submissions. We will summarize the winning solutions and present some discussions on potential research directions.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
Aurora Guard: Reliable Face Anti-Spoofing via Mobile Lighting System
Authors:
Jian Zhang,
Ying Tai,
Tai** Yao,
Jia Meng,
Shouhong Ding,
Chengjie Wang,
Jilin Li,
Feiyue Huang,
Rongrong Ji
Abstract:
Face authentication on mobile end has been widely applied in various scenarios. Despite the increasing reliability of cutting-edge face authentication/verification systems to variations like blinking eye and subtle facial expression, anti-spoofing against high-resolution rendering replay of paper photos or digital videos retains as an open problem. In this paper, we propose a simple yet effective…
▽ More
Face authentication on mobile end has been widely applied in various scenarios. Despite the increasing reliability of cutting-edge face authentication/verification systems to variations like blinking eye and subtle facial expression, anti-spoofing against high-resolution rendering replay of paper photos or digital videos retains as an open problem. In this paper, we propose a simple yet effective face anti-spoofing system, termed Aurora Guard (AG). Our system firstly extracts the normal cues via light reflection analysis, and then adopts an end-to-end trainable multi-task Convolutional Neural Network (CNN) to accurately recover subjects' intrinsic depth and material map to assist liveness classification, along with the light CAPTCHA checking mechanism in the regression branch to further improve the system reliability. Experiments on public Replay-Attack and CASIA datasets demonstrate the merits of our proposed method over the state-of-the-arts. We also conduct extensive experiments on a large-scale dataset containing 12,000 live and diverse spoofing samples, which further validates the generalization ability of our method in the wild.
△ Less
Submitted 1 February, 2021;
originally announced February 2021.
-
Towards Speeding up Adversarial Training in Latent Spaces
Authors:
Yaguan Qian,
Qiqi Shao,
Tengteng Yao,
Bin Wang,
Shouling Ji,
Shaoning Zeng,
Zhaoquan Gu,
Wassim Swaileh
Abstract:
Adversarial training is wildly considered as one of the most effective way to defend against adversarial examples. However, existing adversarial training methods consume unbearable time, due to the fact that they need to generate adversarial examples in the large input space. To speed up adversarial training, we propose a novel adversarial training method that does not need to generate real advers…
▽ More
Adversarial training is wildly considered as one of the most effective way to defend against adversarial examples. However, existing adversarial training methods consume unbearable time, due to the fact that they need to generate adversarial examples in the large input space. To speed up adversarial training, we propose a novel adversarial training method that does not need to generate real adversarial examples. By adding perturbations to logits to generate Endogenous Adversarial Examples (EAEs) -- the adversarial examples in the latent space, the time consuming gradient calculation can be avoided. Extensive experiments are conducted on CIFAR-10 and ImageNet, and the results show that comparing to state-of-the-art methods, our EAE adversarial training not only shortens the training time, but also enhances the robustness of the model and has less impact on the accuracy of clean examples than the existing methods.
△ Less
Submitted 8 March, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
Authors:
Yehao Li,
Yingwei Pan,
Ting Yao,
**gwen Chen,
Tao Mei
Abstract:
Despite having impressive vision-language (VL) pretraining with BERT-based encoder for VL understanding, the pretraining of a universal encoder-decoder for both VL understanding and generation remains challenging. The difficulty originates from the inherently different peculiarities of the two disciplines, e.g., VL understanding tasks capitalize on the unrestricted message passing across modalitie…
▽ More
Despite having impressive vision-language (VL) pretraining with BERT-based encoder for VL understanding, the pretraining of a universal encoder-decoder for both VL understanding and generation remains challenging. The difficulty originates from the inherently different peculiarities of the two disciplines, e.g., VL understanding tasks capitalize on the unrestricted message passing across modalities, while generation tasks only employ visual-to-textual message passing. In this paper, we start with a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved to separately perform each type of proxy tasks, for simultaneous VL understanding and generation pretraining. Moreover, for VL pretraining, the dominant way is to replace some input visual/word tokens with mask tokens and enforce the multi-modal encoder/decoder to reconstruct the original tokens, but no mask token is involved when fine-tuning on downstream tasks. As an alternative, we propose a primary scheduled sampling strategy that elegantly mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner. Extensive experiments demonstrate the compelling generalizability of our pretrained encoder-decoder by fine-tuning on four VL understanding and generation downstream tasks. Source code is available at \url{https://github.com/YehLi/TDEN}.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
ComQA:Compositional Question Answering via Hierarchical Graph Neural Networks
Authors:
Bingning Wang,
Ting Yao,
Weipeng Chen,
**gfang Xu,
Xiaochuan Wang
Abstract:
With the development of deep learning techniques and large scale datasets, the question answering (QA) systems have been quickly improved, providing more accurate and satisfying answers. However, current QA systems either focus on the sentence-level answer, i.e., answer selection, or phrase-level answer, i.e., machine reading comprehension. How to produce compositional answers has not been through…
▽ More
With the development of deep learning techniques and large scale datasets, the question answering (QA) systems have been quickly improved, providing more accurate and satisfying answers. However, current QA systems either focus on the sentence-level answer, i.e., answer selection, or phrase-level answer, i.e., machine reading comprehension. How to produce compositional answers has not been throughout investigated. In compositional question answering, the systems should assemble several supporting evidence from the document to generate the final answer, which is more difficult than sentence-level or phrase-level QA. In this paper, we present a large-scale compositional question answering dataset containing more than 120k human-labeled questions. The answer in this dataset is composed of discontiguous sentences in the corresponding document. To tackle the ComQA problem, we proposed a hierarchical graph neural networks, which represents the document from the low-level word to the high-level sentence. We also devise a question selection and node selection task for pre-training. Our proposed model achieves a significant improvement over previous machine reading comprehension methods and pre-training methods. Codes and dataset can be found at \url{https://github.com/benywon/ComQA}.
△ Less
Submitted 16 January, 2021;
originally announced January 2021.
-
A Model of Two Tales: Dual Transfer Learning Framework for Improved Long-tail Item Recommendation
Authors:
Yin Zhang,
Derek Zhiyuan Cheng,
Tiansheng Yao,
Xinyang Yi,
Lichan Hong,
Ed H. Chi
Abstract:
Highly skewed long-tail item distribution is very common in recommendation systems. It significantly hurts model performance on tail items. To improve tail-item recommendation, we conduct research to transfer knowledge from head items to tail items, leveraging the rich user feedback in head items and the semantic connections between head and tail items. Specifically, we propose a novel dual transf…
▽ More
Highly skewed long-tail item distribution is very common in recommendation systems. It significantly hurts model performance on tail items. To improve tail-item recommendation, we conduct research to transfer knowledge from head items to tail items, leveraging the rich user feedback in head items and the semantic connections between head and tail items. Specifically, we propose a novel dual transfer learning framework that jointly learns the knowledge transfer from both model-level and item-level: 1. The model-level knowledge transfer builds a generic meta-map** of model parameters from few-shot to many-shot model. It captures the implicit data augmentation on the model-level to improve the representation learning of tail items. 2. The item-level transfer connects head and tail items through item-level features, to ensure a smooth transfer of meta-map** from head items to tail items. The two types of transfers are incorporated to ensure the learned knowledge from head items can be well applied for tail item representation learning in the long-tail distribution settings. Through extensive experiments on two benchmark datasets, results show that our proposed dual transfer learning framework significantly outperforms other state-of-the-art methods for tail item recommendation in hit ratio and NDCG. It is also very encouraging that our framework further improves head items and overall performance on top of the gains on tail items.
△ Less
Submitted 7 March, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
Develo** Univariate Neurodegeneration Biomarkers with Low-Rank and Sparse Subspace Decomposition
Authors:
Gang Wang,
Qunxi Dong,
Jianfeng Wu,
Yi Su,
Kewei Chen,
Qingtang Su,
Xiaofeng Zhang,
**guang Hao,
Tao Yao,
Li Liu,
Caiming Zhang,
Richard J Caselli,
Eric M Reiman,
Yalin Wang
Abstract:
Cognitive decline due to Alzheimer's disease (AD) is closely associated with brain structure alterations captured by structural magnetic resonance imaging (sMRI). It supports the validity to develop sMRI-based univariate neurodegeneration biomarkers (UNB). However, existing UNB work either fails to model large group variances or does not capture AD dementia (ADD) induced changes. We propose a nove…
▽ More
Cognitive decline due to Alzheimer's disease (AD) is closely associated with brain structure alterations captured by structural magnetic resonance imaging (sMRI). It supports the validity to develop sMRI-based univariate neurodegeneration biomarkers (UNB). However, existing UNB work either fails to model large group variances or does not capture AD dementia (ADD) induced changes. We propose a novel low-rank and sparse subspace decomposition method capable of stably quantifying the morphological changes induced by ADD. Specifically, we propose a numerically efficient rank minimization mechanism to extract group common structure and impose regularization constraints to encode the original 3D morphometry connectivity. Further, we generate regions-of-interest (ROI) with group difference study between common subspaces of $Aβ+$ AD and $Aβ-$ cognitively unimpaired (CU) groups. A univariate morphometry index (UMI) is constructed from these ROIs by summarizing individual morphological characteristics weighted by normalized difference between $Aβ+$ AD and $Aβ-$ CU groups. We use hippocampal surface radial distance feature to compute the UMIs and validate our work in the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. With hippocampal UMIs, the estimated minimum sample sizes needed to detect a 25$\%$ reduction in the mean annual change with 80$\%$ power and two-tailed $P=0.05$ are 116, 279 and 387 for the longitudinal $Aβ+$ AD, $Aβ+$ mild cognitive impairment (MCI) and $Aβ+$ CU groups, respectively. Additionally, for MCI patients, UMIs well correlate with hazard ratio of conversion to AD ($4.3$, $95\%$ CI=$2.3-8.2$) within 18 months. Our experimental results outperform traditional hippocampal volume measures and suggest the application of UMI as a potential UNB.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.