Search | arXiv e-print repository

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

Authors: Xikang Yang, Xuehai Tang, Fuqing Zhu, Jizhong Han, Songlin Hu

Abstract: Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the o… ▽ More Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the original image rather than the target tokens. To address this challenge, we propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts, thereby improving the probability distribution of the target tokens. By shifting the contextual semantics towards the target tokens instead of the original image semantics, CIA enhances the cross-prompt transferability of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and LLaVA models show that CIA outperforms existing methods in cross-prompt transferability, demonstrating its potential for more effective adversarial strategies in VLMs. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 13 pages

arXiv:2406.13275 [pdf, other]

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A. △ Less

Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.07362 [pdf, other]

AI.vs.Clinician: Unveiling Intricate Interactions Between AI and Clinicians through an Open-Access Database

Authors: Wanling Gao, Yuan Liu, Zhuoming Yu, Dandan Cui, Wen**g Liu, Xiaoshuang Liang, Jiahui Zhao, Jiyue Xie, Hao Li, Li Ma, Ning Ye, Yumiao Kang, Dingfeng Luo, Peng Pan, Wei Huang, Zhongmou Liu, Jizhong Hu, Fan Huang, Gangyuan Zhao, Chongrong Jiang, Tianyi Wei, Zhifei Zhang, Yunyou Huang, Jianfeng Zhan

Abstract: Artificial Intelligence (AI) plays a crucial role in medical field and has the potential to revolutionize healthcare practices. However, the success of AI models and their impacts hinge on the synergy between AI and medical specialists, with clinicians assuming a dominant role. Unfortunately, the intricate dynamics and interactions between AI and clinicians remain undiscovered and thus hinder AI f… ▽ More Artificial Intelligence (AI) plays a crucial role in medical field and has the potential to revolutionize healthcare practices. However, the success of AI models and their impacts hinge on the synergy between AI and medical specialists, with clinicians assuming a dominant role. Unfortunately, the intricate dynamics and interactions between AI and clinicians remain undiscovered and thus hinder AI from being translated into medical practice. To address this gap, we have curated a groundbreaking database called AI.vs.Clinician. This database is the first of its kind for studying the interactions between AI and clinicians. It derives from 7,500 collaborative diagnosis records on a life-threatening medical emergency -- Sepsis -- from 14 medical centers across China. For the patient cohorts well-chosen from MIMIC databases, the AI-related information comprises the model property, feature input, diagnosis decision, and inferred probabilities of sepsis onset presently and within next three hours. The clinician-related information includes the viewed examination data and sequence, viewed time, preliminary and final diagnosis decisions with or without AI assistance, and recommended treatment. △ Less

Submitted 15 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: 12 pages

arXiv:2406.07012 [pdf, other]

Bridging Language Gaps in Audio-Text Retrieval

Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap. △ Less

Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: interspeech2024

arXiv:2405.05610 [pdf, other]

Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM

Authors: Xikang Yang, Xuehai Tang, Songlin Hu, Jizhong Han

Abstract: Large language models (LLMs) have achieved remarkable performance in various natural language processing tasks, especially in dialogue systems. However, LLM may also pose security and moral threats, especially in multi round conversations where large models are more easily guided by contextual content, resulting in harmful or biased responses. In this paper, we present a novel method to attack LLM… ▽ More Large language models (LLMs) have achieved remarkable performance in various natural language processing tasks, especially in dialogue systems. However, LLM may also pose security and moral threats, especially in multi round conversations where large models are more easily guided by contextual content, resulting in harmful or biased responses. In this paper, we present a novel method to attack LLMs in multi-turn dialogues, called CoA (Chain of Attack). CoA is a semantic-driven contextual multi-turn attack method that adaptively adjusts the attack policy through contextual feedback and semantic relevance during multi-turn of dialogue with a large model, resulting in the model producing unreasonable or harmful content. We evaluate CoA on different LLMs and datasets, and show that it can effectively expose the vulnerabilities of LLMs, and outperform existing attack methods. Our work provides a new perspective and tool for attacking and defending LLMs, and contributes to the security and ethical assessment of dialogue systems. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.19171 [pdf, other]

Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Authors: Cai Yu, Shan Jia, Xiaomeng Fu, ** Liu, Jiahe Tian, Jiao Dai, Xi Wang, Siwei Lyu, Jizhong Han

Abstract: With the rising prevalence of deepfakes, there is a growing interest in develo** generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance… ▽ More With the rising prevalence of deepfakes, there is a growing interest in develo** generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: accepted by ICME 2024

arXiv:2403.10512 [pdf, other]

doi 10.1145/3613904.3642615

Surveyor: Facilitating Discovery Within Video Games for Blind and Low Vision Players

Authors: Vishnu Nair, Hanxiu 'Hazel' Zhu, Peize Song, Jizhong Wang, Brian A. Smith

Abstract: Video games are increasingly accessible to blind and low vision (BLV) players, yet many aspects remain inaccessible. One aspect is the joy players feel when they explore environments and make new discoveries, which is integral to many games. Sighted players experience discovery by surveying environments and identifying unexplored areas. Current accessibility tools, however, guide BLV players direc… ▽ More Video games are increasingly accessible to blind and low vision (BLV) players, yet many aspects remain inaccessible. One aspect is the joy players feel when they explore environments and make new discoveries, which is integral to many games. Sighted players experience discovery by surveying environments and identifying unexplored areas. Current accessibility tools, however, guide BLV players directly to items and places, robbing them of that experience. Thus, a crucial challenge is to develop navigation assistance tools that also foster exploration and discovery. To address this challenge, we propose the concept of exploration assistance in games and design Surveyor, an in-game exploration assistance tool that enhances discovery by tracking where BLV players look and highlighting unexplored areas. We designed Surveyor using insights from a formative study and compared Surveyor's effectiveness to approaches found in existing accessible games. Our findings reveal implications for facilitating richer play experiences for BLV users within games. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Journal ref: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24), May 2024

arXiv:2403.08487 [pdf, other]

Model Will Tell: Training Membership Inference for Diffusion Models

Authors: Xiaomeng Fu, Xi Wang, Qiao Li, ** Liu, Jiao Dai, Jizhong Han

Abstract: Diffusion models pose risks of privacy breaches and copyright disputes, primarily stemming from the potential utilization of unauthorized data during the training phase. The Training Membership Inference (TMI) task aims to determine whether a specific sample has been used in the training process of a target model, representing a critical tool for privacy violation verification. However, the increa… ▽ More Diffusion models pose risks of privacy breaches and copyright disputes, primarily stemming from the potential utilization of unauthorized data during the training phase. The Training Membership Inference (TMI) task aims to determine whether a specific sample has been used in the training process of a target model, representing a critical tool for privacy violation verification. However, the increased stochasticity inherent in diffusion renders traditional shadow-model-based or metric-based methods ineffective when applied to diffusion models. Moreover, existing methods only yield binary classification labels which lack necessary comprehensibility in practical applications. In this paper, we explore a novel perspective for the TMI task by leveraging the intrinsic generative priors within the diffusion model. Compared with unseen samples, training samples exhibit stronger generative priors within the diffusion model, enabling the successful reconstruction of substantially degraded training images. Consequently, we propose the Degrade Restore Compare (DRC) framework. In this framework, an image undergoes sequential degradation and restoration, and its membership is determined by comparing it with the restored counterpart. Experimental results verify that our approach not only significantly outperforms existing methods in terms of accuracy but also provides comprehensible decision criteria, offering evidence for potential privacy violations. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 18 pages, 6 figures, 7 tables

arXiv:2312.01663 [pdf, other]

Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training

Authors: Runze He, Shaofei Huang, Xuecheng Nie, Tianrui Hui, Luoqi Liu, Jiao Dai, Jizhong Han, Guanbin Li, Si Liu

Abstract: In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency… ▽ More In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency given a single-view reference image. To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing, aimed at foreground-only manipulation while preserving the background. For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem among different views in image-driven editing. Extensive experiments show that our CustomNeRF produces precise editing results under various real scenes for both text- and image-driven settings. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 14 pages, 13 figures, project website: https://customnerf.github.io/

arXiv:2311.15570 [pdf, other]

UFDA: Universal Federated Domain Adaptation with Practical Assumptions

Authors: Xinhui Liu, Zhenghao Chen, Lu** Zhou, Dong Xu, Wei Xi, Gairui Bai, Yihan Zhao, Jizhong Zhao

Abstract: Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, which makes them significantly less feasible for real-world situations and introduces security hazards. This paper relaxes the assumptions from previous FDAs and studies a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label… ▽ More Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, which makes them significantly less feasible for real-world situations and introduces security hazards. This paper relaxes the assumptions from previous FDAs and studies a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent, and the target-domain label set is totally blind. Towards a more effective solution for our newly proposed UFDA scenario, we propose a corresponding methodology called Hot-Learning with Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain shifts and category gaps problems by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. Extensive experiments on three benchmark datasets demonstrate that our method achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to previous methodologies with comprehensive additional assumptions. △ Less

Submitted 19 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: Accepted by AAAI2024

Journal ref: AAAI2024

arXiv:2311.01091 [pdf, other]

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

Authors: Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu

Abstract: Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmen… ▽ More Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins. △ Less

Submitted 10 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted by IJCAI 2023. Since the PNG benchmark adopts a different data partition manner from ours, we update the experimental results on the things/stuff/singulars/plurals subsets based on the PNG's code

arXiv:2309.16148 [pdf, other]

OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Authors: ** Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Abstract: One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one map** between audio signal and head motion sequences, introducing ambiguity correspondences into the m… ▽ More One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one map** between audio signal and head motion sequences, introducing ambiguity correspondences into the map** since people can behave differently in head motions when speaking the same content. This unreasonable map** form fails to model the diversity and produces either nearly static or even exaggerated head motions, which are unnatural and strange. Therefore, the one-shot talking head generation task is actually a one-to-many ill-posed problem and people present diverse head motions when speaking. Based on the above observation, we propose OSM-Net, a \textit{one-to-many} one-shot talking head generation network with natural head motions. OSM-Net constructs a motion space that contains rich and various clip-level head motion features. Each basis of the space represents a feature of meaningful head motion in a clip rather than just a frame, thus providing more coherent and natural motion changes in talking heads. The driving audio is mapped into the motion space, around which various motion features can be sampled within a reasonable range to achieve the one-to-many map**. Besides, the landmark constraint and time window feature input improve the accurate expression feature extraction and video generation. Extensive experiments show that OSM-Net generates more natural realistic head motions under reasonable one-to-many map** paradigm compared with other methods. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: Paper Under Review

arXiv:2309.09501 [pdf, other]

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu

Abstract: Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames sim… ▽ More Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted by IJCAI 2023

arXiv:2308.16635 [pdf, other]

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Authors: ** Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Abstract: Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and… ▽ More Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression. △ Less

Submitted 31 August, 2023; originally announced August 2023.

Comments: Accepted by ACM MM 2023

arXiv:2307.14491 [pdf, other]

A Unified Framework for Modality-Agnostic Deepfakes Detection

Authors: Cai Yu, Peng Chen, Jiahe Tian, ** Liu, Jiao Dai, Xi Wang, Yesheng Chai, Shan Jia, Siwei Lyu, Jizhong Han

Abstract: As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence betw… ▽ More As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we introduce a comprehensive framework that is agnostic to fake modalities, which facilitates the identification of multimodal deepfakes and handles situations with missing modalities, regardless of the manipulations embedded in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as a preliminary task. This efficiently extracts speech correlations across modalities, a feature challenging for deepfakes to replicate. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments on three audio-visual datasets show that our scheme outperforms state-of-the-art detection methods with promising performance on modality-agnostic audio/video deepfakes. △ Less

Submitted 24 October, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2303.17789 [pdf, other]

FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

Authors: ** Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Abstract: One-shot talking head generation has received growing attention in recent years, with various creative and practical applications. An ideal natural and vivid generated talking head video should contain natural head pose changes. However, it is challenging to map head pose sequences from driving audio since there exists a natural gap between audio-visual modalities. In this work, we propose a Flow-… ▽ More One-shot talking head generation has received growing attention in recent years, with various creative and practical applications. An ideal natural and vivid generated talking head video should contain natural head pose changes. However, it is challenging to map head pose sequences from driving audio since there exists a natural gap between audio-visual modalities. In this work, we propose a Flow-guided One-shot model that achieves NaTural head motions(FONT) over generated talking heads. Specifically, the head pose prediction module is designed to generate head pose sequences from the source face and driving audio. We add the random sampling operation and the structural similarity constraint to model the diversity in the one-to-many map** between audio-visual modality, thus predicting natural head poses. Then we develop a keypoint predictor that produces unsupervised keypoints from the source face, driving audio and pose sequences to describe the facial structure information. Finally, a flow-guided occlusion-aware generator is employed to produce photo-realistic talking head videos from the estimated keypoints and source face. Extensive experimental results prove that FONT generates talking heads with natural head poses and synchronized mouth shapes, outperforming other compared methods. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: Accepted by ICME2023

arXiv:2303.06868 [pdf, other]

Deep Learning-based Eye-Tracking Analysis for Diagnosis of Alzheimer's Disease Using 3D Comprehensive Visual Stimuli

Authors: Fangyu Zuo, Peiguang **g, **glin Sun, Jizhong, Duan, Yong Ji, Yu Liu

Abstract: Alzheimer's Disease (AD) causes a continuous decline in memory, thinking, and judgment. Traditional diagnoses are usually based on clinical experience, which is limited by some realistic factors. In this paper, we focus on exploiting deep learning techniques to diagnose AD based on eye-tracking behaviors. Visual attention, as typical eye-tracking behavior, is of great clinical value to detect cogn… ▽ More Alzheimer's Disease (AD) causes a continuous decline in memory, thinking, and judgment. Traditional diagnoses are usually based on clinical experience, which is limited by some realistic factors. In this paper, we focus on exploiting deep learning techniques to diagnose AD based on eye-tracking behaviors. Visual attention, as typical eye-tracking behavior, is of great clinical value to detect cognitive abnormalities in AD patients. To better analyze the differences in visual attention between AD patients and normals, we first conduct a 3D comprehensive visual task on a non-invasive eye-tracking system to collect visual attention heatmaps. We then propose a multi-layered comparison convolution neural network (MC-CNN) to distinguish the visual attention differences between AD patients and normals. In MC-CNN, the multi-layered representations of heatmaps are obtained by hierarchical convolution to better encode eye-movement behaviors, which are further integrated into a distance vector to benefit the comprehensive visual task. Extensive experimental results on the collected dataset demonstrate that MC-CNN achieves consistent validity in classifying AD patients and normals with eye-tracking data. △ Less

Submitted 13 March, 2023; originally announced March 2023.

arXiv:2302.08197 [pdf, other]

OPT: One-shot Pose-Controllable Talking Head Generation

Authors: ** Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Abstract: One-shot talking head generation produces lip-sync talking heads based on arbitrary audio and one source face. To guarantee the naturalness and realness, recent methods propose to achieve free pose control instead of simply editing mouth areas. However, existing methods do not preserve accurate identity of source face when generating head motions. To solve the identity mismatch problem and achieve… ▽ More One-shot talking head generation produces lip-sync talking heads based on arbitrary audio and one source face. To guarantee the naturalness and realness, recent methods propose to achieve free pose control instead of simply editing mouth areas. However, existing methods do not preserve accurate identity of source face when generating head motions. To solve the identity mismatch problem and achieve high-quality free pose control, we present One-shot Pose-controllable Talking head generation network (OPT). Specifically, the Audio Feature Disentanglement Module separates content features from audios, eliminating the influence of speaker-specific information contained in arbitrary driving audios. Later, the mouth expression feature is extracted from the content feature and source face, during which the landmark loss is designed to enhance the accuracy of facial structure and identity preserving quality. Finally, to achieve free pose control, controllable head pose features from reference videos are fed into the Video Generator along with the expression feature and source face to generate new talking heads. Extensive quantitative and qualitative experimental results verify that OPT generates high-quality pose-controllable talking heads with no identity mismatch problem, outperforming previous SOTA methods. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: Accepted by ICASSP2023

arXiv:2301.02371 [pdf, other]

Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

Authors: Shaofei Huang, Zhenwei Shen, Zehao Huang, Zi-han Ding, Jiao Dai, Jizhong Han, Naiyan Wang, Si Liu

Abstract: Monocular 3D lane detection is a challenging task due to its lack of depth information. A popular solution is to first transform the front-viewed (FV) images or features into the bird-eye-view (BEV) space with inverse perspective map** (IPM) and detect lanes from BEV features. However, the reliance of IPM on flat ground assumption and loss of context information make it inaccurate to restore 3D… ▽ More Monocular 3D lane detection is a challenging task due to its lack of depth information. A popular solution is to first transform the front-viewed (FV) images or features into the bird-eye-view (BEV) space with inverse perspective map** (IPM) and detect lanes from BEV features. However, the reliance of IPM on flat ground assumption and loss of context information make it inaccurate to restore 3D information from BEV representations. An attempt has been made to get rid of BEV and predict 3D lanes from FV representations directly, while it still underperforms other BEV-based methods given its lack of structured representation for 3D lanes. In this paper, we define 3D lane anchors in the 3D space and propose a BEV-free method named Anchor3DLane to predict 3D lanes directly from FV representations. 3D lane anchors are projected to the FV features to extract their features which contain both good structural and context information to make accurate predictions. In addition, we also develop a global optimization method that makes use of the equal-width property between lanes to reduce the lateral error of predictions. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane outperforms previous BEV-based methods and achieves state-of-the-art performances. The code is available at: https://github.com/tusen-ai/Anchor3DLane. △ Less

Submitted 28 March, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: Accepted by CVPR 2023

arXiv:2210.06881 [pdf, other]

RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Authors: Xing Wu, Chaochen Gao, Zijia Lin, Zhongyuan Wang, Jizhong Han, Songlin Hu

Abstract: Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions, called visual redundancy. Sparse sampling is also… ▽ More Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions, called visual redundancy. Sparse sampling is also likely to miss important frames corresponding to some text portions, resulting in textual redundancy. Inter-modal redundancy leads to a mismatch of video and text information, hindering the model from better learning the shared semantics across modalities. To alleviate it, we propose Redundancy-aware Video-language Pre-training. We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity. Then, we penalize the highredundant video patches and text tokens through a proposed redundancy-aware contrastive learning. We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC, achieving a significant improvement over the previous stateof-the-art results. Our code are available at https://github.com/caskcsg/VLP/tree/main/RaP. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2210.06432 [pdf, other]

InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Authors: Xing Wu, Chaochen Gao, Zijia Lin, Jizhong Han, Zhongyuan Wang, Songlin Hu

Abstract: Contrastive learning has been extensively studied in sentence embedding learning, which assumes that the embeddings of different views of the same sentence are closer. The constraint brought by this assumption is weak, and a good sentence representation should also be able to reconstruct the original sentence fragments. Therefore, this paper proposes an information-aggregated contrastive learning… ▽ More Contrastive learning has been extensively studied in sentence embedding learning, which assumes that the embeddings of different views of the same sentence are closer. The constraint brought by this assumption is weak, and a good sentence representation should also be able to reconstruct the original sentence fragments. Therefore, this paper proposes an information-aggregated contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. InfoCSE forces the representation of [CLS] positions to aggregate denser sentence information by introducing an additional Masked language model task and a well-designed network. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large, achieving state-of-the-art results among unsupervised sentence representation learning methods. Our code are available at https://github.com/caskcsg/sentemb/tree/main/InfoCSE. △ Less

Submitted 13 October, 2022; v1 submitted 8 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2208.06305 [pdf]

doi 10.12783/shm2021/36339

Robotic Inspection and Characterization of Subsurface Defects on Concrete Structures Using Impact Sounding

Authors: Ejup Hoxha, **glun Feng, Diar Sanakov, Ardian G**ofci, Jizhong Xiao

Abstract: Impact-sounding (IS) and impact-echo (IE) are well-developed non-destructive evaluation (NDE) methods that are widely used for inspections of concrete structures to ensure the safety and sustainability. However, it is a tedious work to collect IS and IE data along grid lines covering a large target area for characterization of subsurface defects. On the other hand, data processing is very complica… ▽ More Impact-sounding (IS) and impact-echo (IE) are well-developed non-destructive evaluation (NDE) methods that are widely used for inspections of concrete structures to ensure the safety and sustainability. However, it is a tedious work to collect IS and IE data along grid lines covering a large target area for characterization of subsurface defects. On the other hand, data processing is very complicated that requires domain experts to interpret the results. To address the above problems, we present a novel robotic inspection system named as Impact-Rover to automate the data collection process and introduce data analytics software to visualize the inspection result allowing regular non-professional people to understand. The system consists of three modules: 1) a robotic platform with vertical mobility to collect IS and IE data in hard-to-reach locations, 2) vision-based positioning module that fuses the RGB-D camera, IMU and wheel encoder to estimate the 6-DOF pose of the robot, 3) a data analytics software module for processing the IS data to generate defect maps. The Impact-Rover hosts both IE and IS devices on a sliding mechanism and can perform move-stop-sample operations to collect multiple IS and IE data at adjustable spacing. The robot takes samples much faster than the manual data collection method because it automatically takes the multiple measurements along a straight line and records the locations. This paper focuses on reporting experimental results on IS. We calculate features and use unsupervised learning methods for analyzing the data. By combining the pose generated by our vision-based localization module and the position of the head of the sliding mechanism we can generate maps of possible defects. The results on concrete slabs demonstrate that our impact-sounding system can effectively reveal shallow defects. △ Less

Submitted 12 August, 2022; originally announced August 2022.

Journal ref: Structural Health Monitorign 2021

arXiv:2206.12628 [pdf, other]

FreSCo: Frequency-Domain Scan Context for LiDAR-based Place Recognition with Translation and Rotation Invariance

Authors: Yongzhi Fan, Xin Du, Lun Luo, Jizhong Shen

Abstract: Place recognition plays a crucial role in re-localization and loop closure detection tasks for robots and vehicles. This paper seeks a well-defined global descriptor for LiDAR-based place recognition. Compared to local descriptors, global descriptors show remarkable performance in urban road scenes but are usually viewpoint-dependent. To this end, we propose a simple yet robust global descriptor d… ▽ More Place recognition plays a crucial role in re-localization and loop closure detection tasks for robots and vehicles. This paper seeks a well-defined global descriptor for LiDAR-based place recognition. Compared to local descriptors, global descriptors show remarkable performance in urban road scenes but are usually viewpoint-dependent. To this end, we propose a simple yet robust global descriptor dubbed FreSCo that decomposes the viewpoint difference during revisit and achieves both translation and rotation invariance by leveraging Fourier Transform and circular shift technique. Besides, a fast two-stage pose estimation method is proposed to estimate the relative pose after place retrieval by utilizing the compact 2D point clouds extracted from the original data. Experiments show that FreSCo exhibited superior performance than contemporaneous methods on sequences of different scenes from multiple datasets. Code will be publicly available at https://github.com/soytony/FreSCo. △ Less

Submitted 27 September, 2022; v1 submitted 25 June, 2022; originally announced June 2022.

Comments: 8 pages, 10 figures. Accepted for ICARCV 2022

arXiv:2206.03789 [pdf, other]

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Authors: Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, Si Liu

Abstract: Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal inte… ▽ More Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on A2D Sentences and J-HMDB Sentences respectively, while consuming around 7x less computational overhead. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: Accepted by CVPR 2022

arXiv:2201.10792 [pdf, other]

On the Effectiveness of Pinyin-Character Dual-Decoding for End-to-End Mandarin Chinese ASR

Authors: Zhao Yang, Dianwen Ng, Xiao Fu, Li** Han, Wei Xi, Rui Wang, Rui Jiang, Jizhong Zhao

Abstract: End-to-end automatic speech recognition (ASR) has achieved promising results. However, most existing end-to-end ASR methods neglect the use of specific language characteristics. For Mandarin Chinese ASR tasks, there exist mutual promotion relationship between Pinyin and Character where Chinese characters can be romanized by Pinyin. Based on the above intuition, we first investigate types of end-to… ▽ More End-to-end automatic speech recognition (ASR) has achieved promising results. However, most existing end-to-end ASR methods neglect the use of specific language characteristics. For Mandarin Chinese ASR tasks, there exist mutual promotion relationship between Pinyin and Character where Chinese characters can be romanized by Pinyin. Based on the above intuition, we first investigate types of end-to-end encoder-decoder based models in the single-input dual-output (SIDO) multi-task framework, after which a novel asynchronous decoding with fuzzy Pinyin sampling method is proposed according to the one-to-one correspondence characteristics between Pinyin and Character. Furthermore, we proposed a two-stage training strategy to make training more stable and converge faster. The results on the test sets of AISHELL-1 dataset show that the proposed enhanced dual-decoder model without a language model is improved by a big margin compared to strong baseline models. △ Less

Submitted 30 March, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

Comments: submitted to INTERSPEECH 2022

arXiv:2110.13125 [pdf]

Automatic Impact-sounding Acoustic Inspection of Concrete Structure

Authors: **glun Feng, Hua Xiao, Ejup Hoxha, Yifeng Song, Liang Yang, Jizhong Xiao

Abstract: Impact sounding signal has been shown to contain information about structural integrity flaws and subsurface objects from previous research. As non-destructive testing (NDT) method, one of the biggest challenges in impact sounding based inspection is the subsurface targets detection and reconstruction. This paper presents the importance and practicability of using solenoids to trigger impact sound… ▽ More Impact sounding signal has been shown to contain information about structural integrity flaws and subsurface objects from previous research. As non-destructive testing (NDT) method, one of the biggest challenges in impact sounding based inspection is the subsurface targets detection and reconstruction. This paper presents the importance and practicability of using solenoids to trigger impact sounding signal and using acoustic data to reconstruct subsurface objects to address this issue. First, by taking advantage of Visual Simultaneous Localization and Map** (V-SLAM), we could obtain the 3D position of the robot during the inspection. Second, our NDE method is based on Frequency Density (FD) analysis for the Fast Fourier Transform (FFT) of the impact sounding signal. At last, by combining the 3D position data and acoustic data, this paper creates a 3D map to highlight the possible subsurface objects. The experimental results demonstrate the feasibility of the method. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Journal ref: 10th International Conference on Structural Health Monitoring of Intelligent Infrastructure, SHMII 10, 2021

arXiv:2109.04380 [pdf, other]

ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding

Authors: Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, Songlin Hu

Abstract: Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embe… ▽ More Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embeddings to build a positive pair. As the length information of a sentence will generally be encoded into the sentence embeddings due to the usage of position embedding in Transformer, each positive pair in unsup-SimCSE actually contains the same length information. And thus unsup-SimCSE trained with these positive pairs is probably biased, which would tend to consider that sentences of the same or similar length are more similar in semantics. Through statistical observations, we find that unsup-SimCSE does have such a problem. To alleviate it, we apply a simple repetition operation to modify the input sentence, and then pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair. Additionally, we draw inspiration from the community of computer vision and introduce a momentum contrast, enlarging the number of negative pairs without additional calculations. The proposed two modifications are applied on positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base. △ Less

Submitted 11 September, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

Comments: COLING 2022

arXiv:2109.04321 [pdf, other]

Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Authors: Xing Wu, Chaochen Gao, Yipeng Su, Jizhong Han, Zhongyuan Wang, Songlin Hu

Abstract: Contrastive learning has been gradually applied to learn high-quality unsupervised sentence embedding. Among the previous un-supervised methods, the latest state-of-the-art method, as far as we know, is unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE uses the InfoNCE1loss function in the training stage by pulling semantically similar sentences together and pushing apart dis-similar ones.Theoretic… ▽ More Contrastive learning has been gradually applied to learn high-quality unsupervised sentence embedding. Among the previous un-supervised methods, the latest state-of-the-art method, as far as we know, is unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE uses the InfoNCE1loss function in the training stage by pulling semantically similar sentences together and pushing apart dis-similar ones.Theoretically, we expect to use larger batches in unsup-SimCSE to get more adequate comparisons among samples and avoid overfitting. However, increasing the batch size does not always lead to improvements, but instead even lead to performance degradation when the batch size exceeds a threshold. Through statistical observation, we find that this is probably due to the introduction of low-confidence negative pairs after in-creasing the batch size. To alleviate this problem, we introduce a simple smoothing strategy upon the InfoNCE loss function, termedGaussian Smoothing InfoNCE (GS-InfoNCE).Specifically, we add random Gaussian noise vectors as negative samples, which act asa smoothing of the negative sample space.Though being simple, the proposed smooth-ing strategy brings substantial improvements to unsup-SimCSE. We evaluate GS-InfoNCEon the standard semantic text similarity (STS)task. GS-InfoNCE outperforms the state-of-the-art unsup-SimCSE by an average Spear-man correlation of 1.38%, 0.72%, 1.17% and0.28% on the base of BERT-base, BERT-large,RoBERTa-base and RoBERTa-large, respectively. △ Less

Submitted 11 September, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

Comments: COLING 2022

arXiv:2106.01907 [pdf, other]

Robotic Inspection of Underground Utilities for Construction Survey Using a Ground Penetrating Radar

Authors: **glun Feng, Liang Yang, Ejup Hoxha, Jiang Biao, Jizhong Xiao

Abstract: Ground Penetrating Radar (GPR) is a very useful non-destructive evaluation (NDE) device for locating and map** underground assets prior to digging and trenching efforts in construction. This paper presents a novel robotic system to automate the GPR data collection process, localize the underground utilities, interpret and reconstruct the underground objects for better visualization allowing regu… ▽ More Ground Penetrating Radar (GPR) is a very useful non-destructive evaluation (NDE) device for locating and map** underground assets prior to digging and trenching efforts in construction. This paper presents a novel robotic system to automate the GPR data collection process, localize the underground utilities, interpret and reconstruct the underground objects for better visualization allowing regular non-professional users to understand the survey results. This system is composed of three modules: 1) an Omni-directional robotic data collection platform, that carries an RGB-D camera with an Inertial Measurement Unit (IMU) and a GPR antenna to perform automatic GPR data collection, and tag each GPR measurement with visual positioning information at every sampling step; 2) a learning-based migration module to interpret the raw GPR B-scan image into a 2D cross-section model of objects; 3) a 3D reconstruction module, i.e., GPRNet, to generate underground utility model represented as fine 3D point cloud. Comparative studies are performed on synthetic data and field GPR raw data with various incompleteness and noise. Experimental results demonstrate that our proposed method achieves a $30.0\%$ higher GPR imaging accuracy in mean Intersection Over Union (IoU) than the conventional back projection (BP) migration approach and $6.9\%$-$7.2\%$ less loss in Chamfer Distance (CD) than baseline methods regarding point cloud model reconstruction. The GPR-based robotic inspection provides an effective tool for civil engineers to detect and survey underground utilities before construction. △ Less

Submitted 19 April, 2022; v1 submitted 3 June, 2021; originally announced June 2021.

arXiv:2105.06818 [pdf, other]

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Authors: Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, Fei Wang

Abstract: Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the que… ▽ More Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation. Therefore, we propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors. In the decoder, a Language-Guided Feature Selection (LGFS) module is proposed to flexibly integrate spatial and temporal features from the two encoders. We also propose a Cross-Modal Adaptive Modulation (CMAM) module to dynamically recombine spatial- and temporal-relevant linguistic features for multimodal feature interaction in each stage of the two encoders. Our method achieves new state-of-the-art performance on two popular benchmarks with less computational overhead than previous approaches. △ Less

Submitted 14 May, 2021; originally announced May 2021.

Comments: Accepted by CVPR 2021

arXiv:2104.02850 [pdf, other]

LI-Net: Large-Pose Identity-Preserving Face Reenactment Network

Authors: ** Liu, Peng Chen, Tao Liang, Zhaoxing Li, Cai Yu, Shuqiao Zou, Jiao Dai, Jizhong Han

Abstract: Face reenactment is a challenging task, as it is difficult to maintain accurate expression, pose and identity simultaneously. Most existing methods directly apply driving facial landmarks to reenact source faces and ignore the intrinsic gap between two identities, resulting in the identity mismatch issue. Besides, they neglect the entanglement of expression and pose features when encoding driving… ▽ More Face reenactment is a challenging task, as it is difficult to maintain accurate expression, pose and identity simultaneously. Most existing methods directly apply driving facial landmarks to reenact source faces and ignore the intrinsic gap between two identities, resulting in the identity mismatch issue. Besides, they neglect the entanglement of expression and pose features when encoding driving faces, leading to inaccurate expressions and visual artifacts on large-pose reenacted faces. To address these problems, we propose a Large-pose Identity-preserving face reenactment network, LI-Net. Specifically, the Landmark Transformer is adopted to adjust driving landmark images, which aims to narrow the identity gap between driving and source landmark images. Then the Face Rotation Module and the Expression Enhancing Generator decouple the transformed landmark image into pose and expression features, and reenact those attributes separately to generate identity-preserving faces with accurate expressions and poses. Both qualitative and quantitative experimental results demonstrate the superiority of our method. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: IEEE International Conference on Multimedia and Expo(ICME) 2021 Oral

arXiv:2101.03929 [pdf, other]

doi 10.1109/TIP.2020.3013142

ORDNet: Capturing Omni-Range Dependencies for Scene Parsing

Authors: Shaofei Huang, Si Liu, Tianrui Hui, Jizhong Han, Bo Li, Jiashi Feng, Shuicheng Yan

Abstract: Learning to capture dependencies between spatial positions is essential to many visual tasks, especially the dense labeling problems like scene parsing. Existing methods can effectively capture long-range dependencies with self-attention mechanism while short ones by local convolution. However, there is still much gap between long-range and short-range dependencies, which largely reduces the model… ▽ More Learning to capture dependencies between spatial positions is essential to many visual tasks, especially the dense labeling problems like scene parsing. Existing methods can effectively capture long-range dependencies with self-attention mechanism while short ones by local convolution. However, there is still much gap between long-range and short-range dependencies, which largely reduces the models' flexibility in application to diverse spatial scales and relationships in complicated natural scene images. To fill such a gap, we develop a Middle-Range (MR) branch to capture middle-range dependencies by restricting self-attention into local patches. Also, we observe that the spatial regions which have large correlations with others can be emphasized to exploit long-range dependencies more accurately, and thus propose a Reweighed Long-Range (RLR) branch. Based on the proposed MR and RLR branches, we build an Omni-Range Dependencies Network (ORDNet) which can effectively capture short-, middle- and long-range dependencies. Our ORDNet is able to extract more comprehensive context information and well adapt to complex spatial variance in scene images. Extensive experiments show that our proposed ORDNet outperforms previous state-of-the-art methods on three scene parsing benchmarks including PASCAL Context, COCO Stuff and ADE20K, demonstrating the superiority of capturing omni-range dependencies in deep models for scene parsing task. △ Less

Submitted 11 January, 2021; originally announced January 2021.

Comments: Published at TIP

Journal ref: IEEE Transactions on Image Processing, 2020, 29: 8251-8263

arXiv:2012.04233 [pdf, other]

Early Detection of Fake News by Utilizing the Credibility of News, Publishers, and Users Based on Weakly Supervised Learning

Authors: Chunyuan Yuan, Qianwen Ma, Wei Zhou, Jizhong Han, Songlin Hu

Abstract: The dissemination of fake news significantly affects personal reputation and public trust. Recently, fake news detection has attracted tremendous attention, and previous studies mainly focused on finding clues from news content or diffusion path. However, the required features of previous models are often unavailable or insufficient in early detection scenarios, resulting in poor performance. Thus… ▽ More The dissemination of fake news significantly affects personal reputation and public trust. Recently, fake news detection has attracted tremendous attention, and previous studies mainly focused on finding clues from news content or diffusion path. However, the required features of previous models are often unavailable or insufficient in early detection scenarios, resulting in poor performance. Thus, early fake news detection remains a tough challenge. Intuitively, the news from trusted and authoritative sources or shared by many users with a good reputation is more reliable than other news. Using the credibility of publishers and users as prior weakly supervised information, we can quickly locate fake news in massive news and detect them in the early stages of dissemination. In this paper, we propose a novel Structure-aware Multi-head Attention Network (SMAN), which combines the news content, publishing, and reposting relations of publishers and users, to jointly optimize the fake news detection and credibility prediction tasks. In this way, we can explicitly exploit the credibility of publishers and users for early fake news detection. We conducted experiments on three real-world datasets, and the results show that SMAN can detect fake news in 4 hours with an accuracy of over 91%, which is much faster than the state-of-the-art models. △ Less

Submitted 13 December, 2020; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: Accepted as a long paper at COLING 2020

arXiv:2011.02635 [pdf, other]

GPR-based Model Reconstruction System for Underground Utilities Using GPRNet

Authors: **glun Feng, Liang Yang, Ejup Hoxha, Diar Sanakov, Stanislav Sotnikov, Jizhong Xiao

Abstract: Ground Penetrating Radar (GPR) is one of the most important non-destructive evaluation (NDE) instruments to detect and locate underground objects (i.e., rebars, utility pipes). Many previous researches focus on GPR image-based feature detection only, and none can process sparse GPR measurements to successfully reconstruct a very fine and detailed 3D model of underground objects for better visualiz… ▽ More Ground Penetrating Radar (GPR) is one of the most important non-destructive evaluation (NDE) instruments to detect and locate underground objects (i.e., rebars, utility pipes). Many previous researches focus on GPR image-based feature detection only, and none can process sparse GPR measurements to successfully reconstruct a very fine and detailed 3D model of underground objects for better visualization. To address this problem, this paper presents a novel robotic system to collect GPR data, localize the underground utilities, and reconstruct the underground objects' dense point cloud model. This system is composed of three modules: 1) visual-inertial-based GPR data collection module, which tags the GPR measurements with positioning information provided by an omnidirectional robot; 2) a deep neural network (DNN) migration module to interpret the raw GPR B-scan image into a cross-section of object model; 3) a DNN-based 3D reconstruction module, i.e., GPRNet, to generate underground utility model with the fine 3D point cloud. In this paper, both the quantitative and qualitative experiment results verify our method that can generate a dense and complete point cloud model of pipe-shaped utilities based on a sparse input, i.e., GPR raw data incompleteness and various noise. The experiment results on synthetic data and field test data further support the effectiveness of our approach. △ Less

Submitted 18 May, 2021; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Accepted by ICRA 2021

arXiv:2010.00515 [pdf, other]

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Authors: Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, Jizhong Han

Abstract: Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context. To tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context… ▽ More Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context. To tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts. △ Less

Submitted 5 October, 2020; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted by ECCV 2020. Code is available at https://github.com/spyflying/LSCM-Refseg

arXiv:2010.00514 [pdf, other]

Referring Image Segmentation via Cross-Modal Progressive Comprehension

Authors: Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li

Abstract: Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalitie… ▽ More Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted by CVPR 2020. Code is available at https://github.com/spyflying/CMPC-Refseg

arXiv:2007.08445 [pdf, other]

Hierarchical Interaction Networks with Rethinking Mechanism for Document-level Sentiment Analysis

Authors: Lingwei Wei, Dou Hu, Wei Zhou, Xuehai Tang, Xiaodan Zhang, Xin Wang, Jizhong Han, Songlin Hu

Abstract: Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information. Recent works have been devoted to leveraging text summarization and have achieved promising results. However, these summarization-based methods did not take full advantage of the summary including ignoring the inherent interactions between the summary and document. As a res… ▽ More Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information. Recent works have been devoted to leveraging text summarization and have achieved promising results. However, these summarization-based methods did not take full advantage of the summary including ignoring the inherent interactions between the summary and document. As a result, they limited the representation to express major points in the document, which is highly indicative of the key sentiment. In this paper, we study how to effectively generate a discriminative representation with explicit subject patterns and sentiment contexts for DSA. A Hierarchical Interaction Networks (HIN) is proposed to explore bidirectional interactions between the summary and document at multiple granularities and learn subject-oriented document representations for sentiment classification. Furthermore, we design a Sentiment-based Rethinking mechanism (SR) by refining the HIN with sentiment label information to learn a more sentiment-aware document representation. We extensively evaluate our proposed models on three public datasets. The experimental results consistently demonstrate the effectiveness of our proposed models and show that HIN-SR outperforms various state-of-the-art methods. △ Less

Submitted 7 September, 2022; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: 17 pages, accepted by ECML-PKDD 2020

arXiv:2004.12498 [pdf, other]

Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes

Authors: Haiyan Wang, Xuejian Rong, Liang Yang, **glun Feng, Jizhong Xiao, Yingli Tian

Abstract: The deficiency of 3D segmentation labels is one of the main obstacles to effective point cloud segmentation, especially for scenes in the wild with varieties of different objects. To alleviate this issue, we propose a novel deep graph convolutional network-based framework for large-scale semantic scene segmentation in point clouds with sole 2D supervision. Different with numerous preceding multi-v… ▽ More The deficiency of 3D segmentation labels is one of the main obstacles to effective point cloud segmentation, especially for scenes in the wild with varieties of different objects. To alleviate this issue, we propose a novel deep graph convolutional network-based framework for large-scale semantic scene segmentation in point clouds with sole 2D supervision. Different with numerous preceding multi-view supervised approaches focusing on single object point clouds, we argue that 2D supervision is capable of providing sufficient guidance information for training 3D semantic segmentation models of natural scene point clouds while not explicitly capturing their inherent structures, even with only single view per training sample. Specifically, a Graph-based Pyramid Feature Network (GPFN) is designed to implicitly infer both global and local features of point sets and an Observability Network (OBSNet) is introduced to further solve object occlusion problem caused by complicated spatial relations of objects in 3D scenes. During the projection process, perspective rendering and semantic fusion modules are proposed to provide refined 2D supervision signals for training along with a 2D-3D joint optimization strategy. Extensive experimental results demonstrate the effectiveness of our 2D supervised framework, which achieves comparable results with the state-of-the-art approaches trained with full 3D labels, for semantic point cloud segmentation on the popular SUNCG synthetic dataset and S3DIS real-world dataset. △ Less

Submitted 17 May, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

Comments: 13 pages, 8 figures, Under review as a journal paper at CVIU

arXiv:2004.02234 [pdf, other]

Feature Super-Resolution Based Facial Expression Recognition for Multi-scale Low-Resolution Faces

Authors: Wei **g, Feng Tian, Jizhong Zhang, Kuo-Ming Chao, Zhenxin Hong, Xu Liu

Abstract: Facial Expressions Recognition(FER) on low-resolution images is necessary for applications like group expression recognition in crowd scenarios(station, classroom etc.). Classifying a small size facial image into the right expression category is still a challenging task. The main cause of this problem is the loss of discriminative feature due to reduced resolution. Super-resolution method is often… ▽ More Facial Expressions Recognition(FER) on low-resolution images is necessary for applications like group expression recognition in crowd scenarios(station, classroom etc.). Classifying a small size facial image into the right expression category is still a challenging task. The main cause of this problem is the loss of discriminative feature due to reduced resolution. Super-resolution method is often used to enhance low-resolution images, but the performance on FER task is limited when on images of very low resolution. In this work, inspired by feature super-resolution methods for object detection, we proposed a novel generative adversary network-based feature level super-resolution method for robust facial expression recognition(FSR-FER). In particular, a pre-trained FER model was employed as feature extractor, and a generator network G and a discriminator network D are trained with features extracted from images of low resolution and original high resolution. Generator network G tries to transform features of low-resolution images to more discriminative ones by making them closer to the ones of corresponding high-resolution images. For better classification performance, we also proposed an effective classification-aware loss re-weighting strategy based on the classification probability calculated by a fixed FER model to make our model focus more on samples that are easily misclassified. Experiment results on Real-World Affective Faces (RAF) Database demonstrate that our method achieves satisfying results on various down-sample factors with a single model and has better performance on low-resolution images compared with methods using image super-resolution and expression recognition separately. △ Less

Submitted 5 April, 2020; originally announced April 2020.

Comments: 13 pages, 5 figures

arXiv:2002.09634 [pdf, other]

Data Augmentation for Copy-Mechanism in Dialogue State Tracking

Authors: Xiaohui Song, Liangjun Zang, Yipeng Su, Xing Wu, Jizhong Han, Songlin Hu

Abstract: While several state-of-the-art approaches to dialogue state tracking (DST) have shown promising performances on several benchmarks, there is still a significant performance gap between seen slot values (i.e., values that occur in both training set and test set) and unseen ones (values that occur in training set but not in test set). Recently, the copy-mechanism has been widely used in DST models t… ▽ More While several state-of-the-art approaches to dialogue state tracking (DST) have shown promising performances on several benchmarks, there is still a significant performance gap between seen slot values (i.e., values that occur in both training set and test set) and unseen ones (values that occur in training set but not in test set). Recently, the copy-mechanism has been widely used in DST models to handle unseen slot values, which copies slot values from user utterance directly. In this paper, we aim to find out the factors that influence the generalization ability of a common copy-mechanism model for DST. Our key observations include: 1) the copy-mechanism tends to memorize values rather than infer them from contexts, which is the primary reason for unsatisfactory generalization performance; 2) greater diversity of slot values in the training set increase the performance on unseen values but slightly decrease the performance on seen values. Moreover, we propose a simple but effective algorithm of data augmentation to train copy-mechanism models, which augments the input dataset by copying user utterances and replacing the real slot values with randomly generated strings. Users could use two hyper-parameters to realize a trade-off between the performances on seen values and unseen ones, as well as a trade-off between overall performance and computational cost. Experimental results on three widely used datasets (WoZ 2.0, DSTC2, and Multi-WoZ 2.0) show the effectiveness of our approach. △ Less

Submitted 22 February, 2020; originally announced February 2020.

arXiv:1911.03626 [pdf, other]

Beyond Statistical Relations: Integrating Knowledge Relations into Style Correlations for Multi-Label Music Style Classification

Authors: Qianwen Ma, Chunyuan Yuan, Wei Zhou, Jizhong Han, Songlin Hu

Abstract: Automatically labeling multiple styles for every song is a comprehensive application in all kinds of music websites. Recently, some researches explore review-driven multi-label music style classification and exploit style correlations for this task. However, their methods focus on mining the statistical relations between different music styles and only consider shallow style relations. Moreover, t… ▽ More Automatically labeling multiple styles for every song is a comprehensive application in all kinds of music websites. Recently, some researches explore review-driven multi-label music style classification and exploit style correlations for this task. However, their methods focus on mining the statistical relations between different music styles and only consider shallow style relations. Moreover, these statistical relations suffer from the underfitting problem because some music styles have little training data. To tackle these problems, we propose a novel knowledge relations integrated framework (KRF) to capture the complete style correlations, which jointly exploits the inherent relations between music styles according to external knowledge and their statistical relations. Based on the two types of relations, we use a graph convolutional network to learn the deep correlations between styles automatically. Experimental results show that our framework significantly outperforms state-of-the-art methods. Further studies demonstrate that our framework can effectively alleviate the underfitting problem and learn meaningful style correlations. The source code can be available at https://github.com/Makwen1995/MusicGenre. △ Less

Submitted 11 January, 2021; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: Accepted as WSDM 2020 Regular Paper

arXiv:1909.05364 [pdf, other]

TransSent: Towards Generation of Structured Sentences with Discourse Marker

Authors: Xing Wu, Dongjun Wei, Liangjun Zang, Jizhong Han, Songlin Hu

Abstract: Structured sentences are important expressions in human writings and dialogues. Previous works on neural text generation fused semantic and structural information by encoding the entire sentence into a mixed hidden representation. However, when a generated sentence becomes complicated, the structure is difficult to be properly maintained. To alleviate this problem, we explicitly separate the model… ▽ More Structured sentences are important expressions in human writings and dialogues. Previous works on neural text generation fused semantic and structural information by encoding the entire sentence into a mixed hidden representation. However, when a generated sentence becomes complicated, the structure is difficult to be properly maintained. To alleviate this problem, we explicitly separate the modeling process of semantic and structural information. Intuitively, humans generate structured sentences by directly connecting discourses with discourse markers (such as and, but, etc.). Therefore, we propose a task that mimics this process, called discourse transfer. This task represents a structured sentence as (head discourse, discourse marker, tail discourse), and aims at tail discourse generation based on head discourse and discourse marker. We also propose a corresponding model called TransSent, which interprets the relationship between two discourses as a translation1 from the head discourse to the tail discourse in the embedding space. We experiment TransSent not only in discourse transfer task but also in free text generation and dialogue generation tasks. Automatic and human evaluation results show that TransSent can generate structured sentences with high quality, and has certain scalability in different tasks. △ Less

Submitted 8 May, 2020; v1 submitted 5 September, 2019; originally announced September 2019.

Comments: 5 figures

arXiv:1909.04465 [pdf, other]

Jointly embedding the local and global relations of heterogeneous graph for rumor detection

Authors: Chunyuan Yuan, Qianwen Ma, Wei Zhou, Jizhong Han, Songlin Hu

Abstract: The development of social media has revolutionized the way people communicate, share information and make decisions, but it also provides an ideal platform for publishing and spreading rumors. Existing rumor detection methods focus on finding clues from text content, user profiles, and propagation patterns. However, the local semantic relation and global structural information in the message propa… ▽ More The development of social media has revolutionized the way people communicate, share information and make decisions, but it also provides an ideal platform for publishing and spreading rumors. Existing rumor detection methods focus on finding clues from text content, user profiles, and propagation patterns. However, the local semantic relation and global structural information in the message propagation graph have not been well utilized by previous works. In this paper, we present a novel global-local attention network (GLAN) for rumor detection, which jointly encodes the local semantic and global structural information. We first generate a better integrated representation for each source tweet by fusing the semantic information of related retweets with the attention mechanism. Then, we model the global relationships among all source tweets, retweets, and users as a heterogeneous graph to capture the rich structural information for rumor detection. We conduct experiments on three real-world datasets, and the results demonstrate that GLAN significantly outperforms the state-of-the-art models in both rumor detection and early detection scenarios. △ Less

Submitted 11 September, 2019; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: 10 pages, Accepted to the IEEE International Conference on Data Mining 2019

arXiv:1909.04455 [pdf, other]

Learning review representations from user and product level information for spam detection

Authors: Chunyuan Yuan, Wei Zhou, Qianwen Ma, Shangwen Lv, Jizhong Han, Songlin Hu

Abstract: Opinion spam has become a widespread problem in social media, where hired spammers write deceptive reviews to promote or demote products to mislead the consumers for profit or fame. Existing works mainly focus on manually designing discrete textual or behavior features, which cannot capture complex semantics of reviews. Although recent works apply deep learning methods to learn review-level semant… ▽ More Opinion spam has become a widespread problem in social media, where hired spammers write deceptive reviews to promote or demote products to mislead the consumers for profit or fame. Existing works mainly focus on manually designing discrete textual or behavior features, which cannot capture complex semantics of reviews. Although recent works apply deep learning methods to learn review-level semantic features, their models ignore the impact of the user-level and product-level information on learning review semantics and the inherent user-review-product relationship information. In this paper, we propose a Hierarchical Fusion Attention Network (HFAN) to automatically learn the semantics of reviews from the user and product level. Specifically, we design a multi-attention unit to extract user(product)-related review information. Then, we use orthogonal decomposition and fusion attention to learn a user, review, and product representation from the review information. Finally, we take the review as a relation between user and product entity and apply TransH to jointly encode this relationship into review representation. Experimental results obtained more than 10\% absolute precision improvement over the state-of-the-art performances on four real-world datasets, which show the effectiveness and versatility of the model. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: 6 pages. Accepted as IEEE ICDM 2019, Short Paper

arXiv:1908.08039 [pdf, other]

"Mask and Infill" : Applying Masked Language Model to Sentiment Transfer

Authors: Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, Songlin Hu

Abstract: This paper focuses on the task of sentiment transfer on non-parallel text, which modifies sentiment attributes (e.g., positive or negative) of sentences while preserving their attribute-independent content. Due to the limited capability of RNNbased encoder-decoder structure to capture deep and long-range dependencies among words, previous works can hardly generate satisfactory sentences from scrat… ▽ More This paper focuses on the task of sentiment transfer on non-parallel text, which modifies sentiment attributes (e.g., positive or negative) of sentences while preserving their attribute-independent content. Due to the limited capability of RNNbased encoder-decoder structure to capture deep and long-range dependencies among words, previous works can hardly generate satisfactory sentences from scratch. When humans convert the sentiment attribute of a sentence, a simple but effective approach is to only replace the original sentimental tokens in the sentence with target sentimental expressions, instead of building a new sentence from scratch. Such a process is very similar to the task of Text Infilling or Cloze, which could be handled by a deep bidirectional Masked Language Model (e.g. BERT). So we propose a two step approach "Mask and Infill". In the mask step, we separate style from content by masking the positions of sentimental tokens. In the infill step, we retrofit MLM to Attribute Conditional MLM, to infill the masked positions by predicting words or phrases conditioned on the context1 and target sentiment. We evaluate our model on two review datasets with quantitative, qualitative, and human evaluations. Experimental results demonstrate that our models improve state-of-the-art performance. △ Less

Submitted 21 August, 2019; originally announced August 2019.

Comments: IJCAI 2019

arXiv:1905.10625 [pdf, other]

ESA: Entity Summarization with Attention

Authors: Dongjun Wei, Yaxin Liu, Fuqing Zhu, Liangjun Zang, Wei Zhou, Jizhong Han, Songlin Hu

Abstract: Entity summarization aims at creating brief but informative descriptions of entities from knowledge graphs. While previous work mostly focused on traditional techniques such as clustering algorithms and graph models, we ask how to apply deep learning methods into this task. In this paper we propose ESA, a neural network with supervised attention mechanisms for entity summarization. Specifically, w… ▽ More Entity summarization aims at creating brief but informative descriptions of entities from knowledge graphs. While previous work mostly focused on traditional techniques such as clustering algorithms and graph models, we ask how to apply deep learning methods into this task. In this paper we propose ESA, a neural network with supervised attention mechanisms for entity summarization. Specifically, we calculate attention weights for facts in each entity, and rank facts to generate reliable summaries. We explore techniques to solve difficult learning problems presented by the ESA, and demonstrate the effectiveness of our model in comparison with the state-of-the-art methods. Experimental results show that our model improves the quality of the entity summaries in both F-measure and MAP. △ Less

Submitted 25 May, 2020; v1 submitted 25 May, 2019; originally announced May 2019.

Comments: 12pages, accepted in EYRE@CIKM'2019

arXiv:1903.11919 [pdf, other]

Imbalanced Sentiment Classification Enhanced with Discourse Marker

Authors: Tao Zhang, Xing Wu, Meng Lin, Jizhong Han, Songlin Hu

Abstract: Imbalanced data commonly exists in real world, espacially in sentiment-related corpus, making it difficult to train a classifier to distinguish latent sentiment in text data. We observe that humans often express transitional emotion between two adjacent discourses with discourse markers like "but", "though", "while", etc, and the head discourse and the tail discourse 3 usually indicate opposite em… ▽ More Imbalanced data commonly exists in real world, espacially in sentiment-related corpus, making it difficult to train a classifier to distinguish latent sentiment in text data. We observe that humans often express transitional emotion between two adjacent discourses with discourse markers like "but", "though", "while", etc, and the head discourse and the tail discourse 3 usually indicate opposite emotional tendencies. Based on this observation, we propose a novel plug-and-play method, which first samples discourses according to transitional discourse markers and then validates sentimental polarities with the help of a pretrained attention-based model. Our method increases sample diversity in the first place, can serve as a upstream preprocessing part in data augmentation. We conduct experiments on three public sentiment datasets, with several frequently used algorithms. Results show that our method is found to be consistently effective, even in highly imbalanced scenario, and easily be integrated with oversampling method to boost the performance on imbalanced sentiment classification. △ Less

Submitted 28 March, 2019; originally announced March 2019.

Comments: 12 pages, 1 figures

arXiv:1901.06773 [pdf, other]

AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Neural Networks

Authors: **rong Guo, Wantao Liu, Wang Wang, Qu Lu, Songlin Hu, Jizhong Han, Ruixuan Li

Abstract: Typically, Ultra-deep neural network(UDNN) tends to yield high-quality model, but its training process is usually resource intensive and time-consuming. Modern GPU's scarce DRAM capacity is the primary bottleneck that hinders the trainability and the training efficiency of UDNN. In this paper, we present "AccUDNN", an accelerator that aims to make the utmost use of finite GPU memory resources to s… ▽ More Typically, Ultra-deep neural network(UDNN) tends to yield high-quality model, but its training process is usually resource intensive and time-consuming. Modern GPU's scarce DRAM capacity is the primary bottleneck that hinders the trainability and the training efficiency of UDNN. In this paper, we present "AccUDNN", an accelerator that aims to make the utmost use of finite GPU memory resources to speed up the training process of UDNN. AccUDNN mainly includes two modules: memory optimizer and hyperparameter tuner. Memory optimizer develops a performance-model guided dynamic swap out/in strategy, by offloading appropriate data to host memory, GPU memory footprint can be significantly slashed to overcome the restriction of trainability of UDNN. After applying the memory optimization strategy, hyperparameter tuner is designed to explore the efficiency-optimal minibatch size and the matched learning rate. Evaluations demonstrate that AccUDNN cuts down the GPU memory requirement of ResNet-152 from more than 24GB to 8GB. In turn, given 12GB GPU memory budget, the efficiency-optimal minibatch size can reach 4.2x larger than original Caffe. Benefiting from better utilization of single GPU's computing resources and fewer parameter synchronization of large minibatch size, 7.7x speed-up is achieved by 8 GPUs' cluster without any communication optimization and no accuracy losses. △ Less

Submitted 20 June, 2019; v1 submitted 20 January, 2019; originally announced January 2019.

Comments: 12 pages,11 figures, 3 tables

arXiv:1812.06705 [pdf, other]

Conditional BERT Contextual Augmentation

Authors: Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, Songlin Hu

Abstract: We propose a novel data augmentation method for labeled sentences called conditional BERT contextual augmentation. Data augmentation methods are often applied to prevent overfitting and improve generalization of deep neural network models. Recently proposed contextual augmentation augments labeled sentences by randomly replacing words with more varied substitutions predicted by language model. BER… ▽ More We propose a novel data augmentation method for labeled sentences called conditional BERT contextual augmentation. Data augmentation methods are often applied to prevent overfitting and improve generalization of deep neural network models. Recently proposed contextual augmentation augments labeled sentences by randomly replacing words with more varied substitutions predicted by language model. BERT demonstrates that a deep bidirectional language model is more powerful than either an unidirectional language model or the shallow concatenation of a forward and backward model. We retrofit BERT to conditional BERT by introducing a new conditional masked language model\footnote{The term "conditional masked language model" appeared once in original BERT paper, which indicates context-conditional, is equivalent to term "masked language model". In our paper, "conditional masked language model" indicates we apply extra label-conditional constraint to the "masked language model".} task. The well trained conditional BERT can be applied to enhance contextual augmentation. Experiments on six various different text classification tasks show that our method can be easily applied to both convolutional or recurrent neural networks classifier to obtain obvious improvement. △ Less

Submitted 17 December, 2018; originally announced December 2018.

Comments: 9 pages, 1 figure

arXiv:1812.00477 [pdf, other]

Ego-Downward and Ambient Video based Person Location Association

Authors: Liang Yang, Hao Jiang, Jizhong Xiao, Zhouyuan Huo

Abstract: Using an ego-centric camera to do localization and tracking is highly needed for urban navigation and indoor assistive system when GPS is not available or not accurate enough. The traditional hand-designed feature tracking and estimation approach would fail without visible features. Recently, there are several works exploring to use context features to do localization. However, all of these suffer… ▽ More Using an ego-centric camera to do localization and tracking is highly needed for urban navigation and indoor assistive system when GPS is not available or not accurate enough. The traditional hand-designed feature tracking and estimation approach would fail without visible features. Recently, there are several works exploring to use context features to do localization. However, all of these suffer severe accuracy loss if given no visual context information. To provide a possible solution to this problem, this paper proposes a camera system with both ego-downward and third-static view to perform localization and tracking in a learning approach. Besides, we also proposed a novel action and motion verification model for cross-view verification and localization. We performed comparative experiments based on our collected dataset which considers the same dressing, gender, and background diversity. Results indicate that the proposed model can achieve $18.32 \%$ improvement in accuracy performance. Eventually, we tested the model on multi-people scenarios and obtained an average $67.767 \%$ accuracy. △ Less

Submitted 2 December, 2018; originally announced December 2018.

Showing 1–50 of 59 results for author: Jizhong