Search | arXiv e-print repository

SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation

Authors: Qiang Hu, Zhenyu Yi, Ying Zhou, Fang Peng, Mei Liu, Qiang Li, Zhiwei Wang

Abstract: Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature cons… ▽ More Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature consistency and rebuilding the reliable polyp representation. To achieve this goal, we in this paper propose SALI network, a hybrid of Short-term Alignment Module (SAM) and Long-term Interaction Module (LIM). The SAM learns spatial-aligned features of adjacent frames via deformable convolution and further harmonizes them to capture more stable short-term polyp representation. In case of low-quality frames, the LIM stores the historical polyp representations as a long-term memory bank, and explores the retrospective relations to interactively rebuild more reliable polyp features for the current segmentation. Combing SAM and LIM, the SALI network of video segmentation shows a great robustness to the spatial variations and low-visual cues. Benchmark on the large-scale SUNSEG verifies the superiority of SALI over the current state-of-the-arts by improving Dice by 2.1%, 2.5%, 4.1% and 1.9%, for the four test sub-sets, respectively. Codes are at https://github.com/Scatteredrain/SALI. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted to MICCAI 2024. Code and models: https://github.com/Scatteredrain/SALI

arXiv:2404.16687 [pdf, other]

NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng , et al. (89 additional authors not shown)

Abstract: This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Conte… ▽ More This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC. △ Less

Submitted 7 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.13400 [pdf, other]

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Authors: Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

Abstract: Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA)… ▽ More Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: The project page: https://github.com/linhuixiao/HiVG

arXiv:2403.09611 [pdf, other]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Authors: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman , et al. (7 additional authors not shown)

Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la… ▽ More In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. △ Less

Submitted 18 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2310.14860 [pdf, other]

Adaptive Tuning of Robotic Polishing Skills based on Force Feedback Model

Authors: Yu Wang, Zhouyi Zheng, Chen Chen, Zezheng Wang, Zhitao Gao, Fangyu Peng, Xiaowei Tang, Rong Yan

Abstract: Acquiring human skills offers an efficient approach to tackle complex task planning challenges. When performing a learned skill model for a continuous contact task, such as robot polishing in an uncertain environment, the robot needs to be able to adaptively modify the skill model to suit the environment and perform the desired task. The environmental perturbation of the polishing task is mainly r… ▽ More Acquiring human skills offers an efficient approach to tackle complex task planning challenges. When performing a learned skill model for a continuous contact task, such as robot polishing in an uncertain environment, the robot needs to be able to adaptively modify the skill model to suit the environment and perform the desired task. The environmental perturbation of the polishing task is mainly reflected in the variation of contact force. Therefore, adjusting the task skill model by providing feedback on the contact force deviation is an effective way to meet the task requirements. In this study, a phase-modulated diagonal recurrent neural network (PMDRNN) is proposed for force feedback model learning in the robotic polishing task. The contact between the tool and the workpiece in the polishing task can be considered a dynamic system. In comparison to the existing feedforward neural network phase-modulated neural network (PMNN), PMDRNN combines the diagonal recurrent network structure with the phase-modulated neural network layer to improve the learning performance of the feedback model for dynamic systems. Specifically, data from real-world robot polishing experiments are used to learn the feedback model. PMDRNN demonstrates a significant reduction in the training error of the feedback model when compared to PMNN. Building upon this, the combination of PMDRNN and dynamic movement primitives (DMPs) can be used for real-time adjustment of skills for polishing tasks and effectively improve the robustness of the task skill model. Finally, real-world robotic polishing experiments are conducted to demonstrate the effectiveness of the approach. △ Less

Submitted 22 November, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: This paper has been accepted by The 2023 IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO 2023)

arXiv:2305.08685 [pdf, other]

doi 10.1109/TMM.2023.3321501

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Authors: Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang, Changsheng Xu

Abstract: Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseu… ▽ More Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG. △ Less

Submitted 24 December, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: Accepted by IEEE Transaction on Multimedia (2023), Paper page: https://ieeexplore.ieee.org/abstract/document/10269126. Code are available at https://github.com/linhuixiao/CLIP-VG

arXiv:2211.16191 [pdf, other]

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Authors: Fang Peng, Xiaoshan Yang, Linhui Xiao, Yaowei Wang, Changsheng Xu

Abstract: Although significant progress has been made in few-shot learning, most of existing few-shot image classification methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale Vision-Language Pre-trained models (VLPs) have been gaining increasing attention in few-shot learning because… ▽ More Although significant progress has been made in few-shot learning, most of existing few-shot image classification methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale Vision-Language Pre-trained models (VLPs) have been gaining increasing attention in few-shot learning because they can provide a new paradigm for transferable visual representation learning with easily available text on the Web. However, the VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier to distinguish different images. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative adapted visual features by comprehensively using an implicit knowledge distillation, a vision-specific contrastive loss, and a cross-modal contrastive loss. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification. △ Less

Submitted 20 January, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.12713 [pdf, other]

Reliable Robustness Evaluation via Automatically Constructed Attack Ensembles

Authors: Shengcai Liu, Fu Peng, Ke Tang

Abstract: Attack Ensemble (AE), which combines multiple attacks together, provides a reliable way to evaluate adversarial robustness. In practice, AEs are often constructed and tuned by human experts, which however tends to be sub-optimal and time-consuming. In this work, we present AutoAE, a conceptually simple approach for automatically constructing AEs. In brief, AutoAE repeatedly adds the attack and its… ▽ More Attack Ensemble (AE), which combines multiple attacks together, provides a reliable way to evaluate adversarial robustness. In practice, AEs are often constructed and tuned by human experts, which however tends to be sub-optimal and time-consuming. In this work, we present AutoAE, a conceptually simple approach for automatically constructing AEs. In brief, AutoAE repeatedly adds the attack and its iteration steps to the ensemble that maximizes ensemble improvement per additional iteration consumed. We show theoretically that AutoAE yields AEs provably within a constant factor of the optimal for a given defense. We then use AutoAE to construct two AEs for $l_{\infty}$ and $l_2$ attacks, and apply them without any tuning or adaptation to 45 top adversarial defenses on the RobustBench leaderboard. In all except one cases we achieve equal or better (often the latter) robustness evaluation than existing AEs, and notably, in 29 cases we achieve better robustness evaluation than the best known one. Such performance of AutoAE shows itself as a reliable evaluation protocol for adversarial robustness, which further indicates the huge potential of automatic AE construction. Code is available at \url{https://github.com/LeegerPENG/AutoAE}. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2210.07749

LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge

Authors: Yan Jia, Mi Hong, **gyu Hou, Kailong Ren, Sifan Ma, ** Wang, Fangzhen Peng, Yinglin Ji, Lin Yang, Junjie Wang

Abstract: This paper describes LeVoice automatic speech recognition systems to track2 of intelligent cockpit speech recognition challenge 2022. Track2 is a speech recognition task without limits on the scope of model size. Our main points include deep learning based speech enhancement, text-to-speech based speech generation, training data augmentation via various techniques and speech recognition model fusi… ▽ More This paper describes LeVoice automatic speech recognition systems to track2 of intelligent cockpit speech recognition challenge 2022. Track2 is a speech recognition task without limits on the scope of model size. Our main points include deep learning based speech enhancement, text-to-speech based speech generation, training data augmentation via various techniques and speech recognition model fusion. We compared and fused the hybrid architecture and two kinds of end-to-end architecture. For end-to-end modeling, we used models based on connectionist temporal classification/attention-based encoder-decoder architecture and recurrent neural network transducer/attention-based encoder-decoder architecture. The performance of these models is evaluated with an additional language model to improve word error rates. As a result, our system achieved 10.2\% character error rate on the challenge test set data and ranked third place among the submitted systems in the challenge. △ Less

Submitted 16 October, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: There are experimental errors

arXiv:2210.06772 [pdf, ps, other]

Mitigating Unintended Memorization in Language Models via Alternating Teaching

Authors: Zhe Liu, Xuedong Zhang, Fuchun Peng

Abstract: Recent research has shown that language models have a tendency to memorize rare or unique sequences in the training corpora which can thus leak sensitive attributes of user data. We employ a teacher-student framework and propose a novel approach called alternating teaching to mitigate unintended memorization in sequential modeling. In our method, multiple teachers are trained on disjoint training… ▽ More Recent research has shown that language models have a tendency to memorize rare or unique sequences in the training corpora which can thus leak sensitive attributes of user data. We employ a teacher-student framework and propose a novel approach called alternating teaching to mitigate unintended memorization in sequential modeling. In our method, multiple teachers are trained on disjoint training sets whose privacy one wishes to protect, and teachers' predictions supervise the training of a student model in an alternating manner at each time step. Experiments on LibriSpeech datasets show that the proposed method achieves superior privacy-preserving results than other counterparts. In comparison with no prevention for unintended memorization, the overall utility loss is small when training records are sufficient. △ Less

Submitted 13 October, 2022; originally announced October 2022.

arXiv:2210.01863 [pdf, other]

Group Personalized Federated Learning

Authors: Zhe Liu, Yue Hui, Fuchun Peng

Abstract: Federated learning (FL) can help promote data privacy by training a shared model in a de-centralized manner on the physical devices of clients. In the presence of highly heterogeneous distributions of local data, personalized FL strategy seeks to mitigate the potential client drift. In this paper, we present the group personalization approach for applications of FL in which there exist inherent pa… ▽ More Federated learning (FL) can help promote data privacy by training a shared model in a de-centralized manner on the physical devices of clients. In the presence of highly heterogeneous distributions of local data, personalized FL strategy seeks to mitigate the potential client drift. In this paper, we present the group personalization approach for applications of FL in which there exist inherent partitions among clients that are significantly distinct. In our method, the global FL model is fine-tuned through another FL training process over each homogeneous group of clients, after which each group-specific FL model is further adapted and personalized for any client. The proposed method can be well interpreted from a Bayesian hierarchical modeling perspective. With experiments on two real-world datasets, we demonstrate this approach can achieve superior personalization performance than other FL counterparts. △ Less

Submitted 11 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

arXiv:2209.05281 [pdf, other]

Modeling Dependent Structure for Utterances in ASR Evaluation

Authors: Zhe Liu, Fuchun Peng

Abstract: The bootstrap resampling method has been popular for performing significance analysis on word error rate (WER) in automatic speech recognition (ASR) evaluation. To deal with dependent speech data, the blockwise bootstrap approach is also introduced. By dividing utterances into uncorrelated blocks, this approach resamples these blocks instead of original data. However, it is typically nontrivial to… ▽ More The bootstrap resampling method has been popular for performing significance analysis on word error rate (WER) in automatic speech recognition (ASR) evaluation. To deal with dependent speech data, the blockwise bootstrap approach is also introduced. By dividing utterances into uncorrelated blocks, this approach resamples these blocks instead of original data. However, it is typically nontrivial to uncover the dependent structure among utterances and identify the blocks, which might lead to subjective conclusions in statistical testing. In this paper, we present graphical lasso based methods to explicitly model such dependency and estimate uncorrelated blocks of utterances in a rigorous way, after which blockwise bootstrap is applied on top of the inferred blocks. We show the resulting variance estimator of WER in ASR evaluation is statistically consistent under mild conditions. We also demonstrate the validity of proposed approach on LibriSpeech dataset. △ Less

Submitted 8 October, 2022; v1 submitted 7 September, 2022; originally announced September 2022.

arXiv:2201.11867 [pdf, other]

Neural-FST Class Language Model for End-to-End Speech Recognition

Authors: Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu, Bo Wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer

Abstract: We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework. Our method utilizes a background NNLM which models generic background text together with a collection of domain-specific entities modeled as individual FSTs. Each outpu… ▽ More We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework. Our method utilizes a background NNLM which models generic background text together with a collection of domain-specific entities modeled as individual FSTs. Each output token is generated by a mixture of these components; the mixture weights are estimated with a separately trained neural decider. We show that NFCLM significantly outperforms NNLM by 15.8% relative in terms of Word Error Rate. NFCLM achieves similar performance as traditional NNLM and FST shallow fusion while being less prone to overbiasing and 12 times more compact, making it more suitable for on-device usage. △ Less

Submitted 31 January, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Accepted for publication at ICASSP 2022

arXiv:2201.01686 [pdf, ps, other]

doi 10.3390/e24070961

Optimal Update for Energy Harvesting Sensor with Reliable Backup Energy

Authors: Lixin Wang, Fuzhou Peng, Xiang Chen, Shidong Zhou

Abstract: In this paper, we consider an information update system where a wireless sensor sends timely updates to the destination over an erasure channel with the supply of harvested energy and reliable backup energy. The metric Age of Information(AoI) is adopted to measure the timeliness of the received updates at the destination. We aim to find the optimal information updating policy that minimizes the ti… ▽ More In this paper, we consider an information update system where a wireless sensor sends timely updates to the destination over an erasure channel with the supply of harvested energy and reliable backup energy. The metric Age of Information(AoI) is adopted to measure the timeliness of the received updates at the destination. We aim to find the optimal information updating policy that minimizes the time-average weighted sum of the AoI and the reliable backup energy cost by formulating an infinite state Markov decision process(MDP). The optimal information updating policy is proved to have a threshold structure. Based on this special structure, an algorithm for efficiently computing the optimal policy is proposed. Numerical results show that the optimal updating policy proposed outperforms baseline policies. △ Less

Submitted 5 January, 2022; originally announced January 2022.

Comments: 9 pages, 4 figures. arXiv admin note: substantial text overlap with arXiv:2110.07233

arXiv:2112.14834 [pdf, other]

Training Quantized Deep Neural Networks via Cooperative Coevolution

Authors: Fu Peng, Shengcai Liu, Ning Lu, Ke Tang

Abstract: This work considers a challenging Deep Neural Network(DNN) quantization task that seeks to train quantized DNNs without involving any full-precision operations. Most previous quantization approaches are not applicable to this task since they rely on full-precision gradients to update network weights. To fill this gap, in this work we advocate using Evolutionary Algorithms (EAs) to search for the o… ▽ More This work considers a challenging Deep Neural Network(DNN) quantization task that seeks to train quantized DNNs without involving any full-precision operations. Most previous quantization approaches are not applicable to this task since they rely on full-precision gradients to update network weights. To fill this gap, in this work we advocate using Evolutionary Algorithms (EAs) to search for the optimal low-bits weights of DNNs. To efficiently solve the induced large-scale discrete problem, we propose a novel EA based on cooperative coevolution that repeatedly groups the network weights based on the confidence in their values and focuses on optimizing the ones with the least confidence. To the best of our knowledge, this is the first work that applies EAs to train quantized DNNs. Experiments show that our approach surpasses previous quantization approaches and can train a 4-bit ResNet-20 on the Cifar-10 dataset with the same test accuracy as its full-precision counterpart. △ Less

Submitted 23 May, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: 13 pages, 4 figures, accepted for publication of ICSI

arXiv:2110.10026 [pdf, other]

Private Language Model Adaptation for Speech Recognition

Authors: Zhe Liu, Ke Li, Shreyan Bakshi, Fuchun Peng

Abstract: Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on local devices of users. With the use of federated learning (FL), we introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition (ASR). To address the potential speech… ▽ More Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on local devices of users. With the use of federated learning (FL), we introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition (ASR). To address the potential speech transcription errors in the on-device training corpus, we perform empirical studies on comparing various strategies of leveraging token-level confidence scores to improve the NNLM quality in the FL settings. Experiments show that compared with no model adaptation, the proposed method achieves relative 2.6% and 10.8% word error rate (WER) reductions on two speech evaluation datasets, respectively. We also provide analysis in evaluating privacy guarantees of our presented procedure. △ Less

Submitted 15 June, 2022; v1 submitted 27 September, 2021; originally announced October 2021.

arXiv:2110.07233 [pdf, ps, other]

Optimal Update in Energy Harvesting Aided Terahertz Communications with Random Blocking

Authors: Lixin Wang, Fuzhou Peng, Xiang Chen, Shidong Zhou

Abstract: In this paper, we consider an information update system where wireless sensor sends timely updates to the destination over a random blocking terahertz channel with the supply of harvested energy and reliable energy backup. The paper aims to find the optimal information updating policy that minimize the time-average weighted sum of the Age of information(AoI) and the reliable energy costs by formul… ▽ More In this paper, we consider an information update system where wireless sensor sends timely updates to the destination over a random blocking terahertz channel with the supply of harvested energy and reliable energy backup. The paper aims to find the optimal information updating policy that minimize the time-average weighted sum of the Age of information(AoI) and the reliable energy costs by formulating an infinite state Markov decision process(MDP). With the derivation of the monotonicity of value function on each component, the optimal information updating policy is proved to have a threshold structure. Based on this special structure, an algorithm for efficiently computing the optimal policy is proposed. Numerical results show that the optimal updating policy proposed outperforms baseline policies. △ Less

Submitted 15 October, 2021; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: 9 pages, 4 Postscript figures

arXiv:2109.09061 [pdf, ps, other]

Model-Based Approach for Measuring the Fairness in ASR

Authors: Zhe Liu, Irina-Elena Veliche, Fuchun Peng

Abstract: The issue of fairness arises when the automatic speech recognition (ASR) systems do not perform equally well for all subgroups of the population. In any fairness measurement studies for ASR, the open questions of how to control the nuisance factors, how to handle unobserved heterogeneity across speakers, and how to trace the source of any word error rate (WER) gap among different subgroups are esp… ▽ More The issue of fairness arises when the automatic speech recognition (ASR) systems do not perform equally well for all subgroups of the population. In any fairness measurement studies for ASR, the open questions of how to control the nuisance factors, how to handle unobserved heterogeneity across speakers, and how to trace the source of any word error rate (WER) gap among different subgroups are especially important - if not appropriately accounted for, incorrect conclusions will be drawn. In this paper, we introduce mixed-effects Poisson regression to better measure and interpret any WER difference among subgroups of interest. Particularly, the presented method can effectively address the three problems raised above and is very flexible to use in practical disparity analyses. We demonstrate the validity of proposed model-based approach on both synthetic and real-world speech data. △ Less

Submitted 19 September, 2021; originally announced September 2021.

arXiv:2107.04154 [pdf, other]

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

Authors: Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer

Abstract: Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybri… ▽ More Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different training schemes are evaluated comprehensively on Librispeech, and wp-CTC-bMMI and ch-CTC-bMMI are evaluated on two real world ASR tasks to show their effectiveness. Besides, we also show bi-char(bc) HMM-MMI models can serve as better alignment models than traditional non-neural GMM-HMMs. △ Less

Submitted 26 September, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

Comments: accepted by ASRU 2021

arXiv:2105.12849 [pdf, ps, other]

CARLS: Cross-platform Asynchronous Representation Learning System

Authors: Chun-Ta Lu, Yun Zeng, Da-Cheng Juan, Yicheng Fan, Zhe Li, Jan Dlabal, Yi-Ting Chen, Arjun Gopalan, Allan Heydon, Chun-Sung Ferng, Reah Miyara, Ariel Fuxman, Futang Peng, Zhen Li, Tom Duerig, Andrew Tomkins

Abstract: In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning paradigms where model training benefits from additiona… ▽ More In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning paradigms where model training benefits from additional knowledge inferred or discovered during training, such as node embeddings for graph neural networks or reliable pseudo labels from model predictions. We also describe three learning paradigms -- semi-supervised learning, curriculum learning and multimodal learning -- as examples that can be scaled up efficiently by CARLS. One version of CARLS has been open-sourced and available for download at: https://github.com/tensorflow/neural-structured-learning/tree/master/research/carls △ Less

Submitted 26 May, 2021; originally announced May 2021.

arXiv:2104.12369 [pdf, other]

PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Authors: Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xin**g Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, ** Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang , et al. (13 additional authors not shown)

Abstract: Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named… ▽ More Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$α$, with up to 200 billion parameters. PanGu-$α$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$α$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$α$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$α$ in performing various tasks under few-shot or zero-shot settings. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Comments: The technique report for PanGu-$α$

arXiv:2101.01304 [pdf, ps, other]

Algebraic Geometric Secret Sharing Schemes over Large Fields Are Asymptotically Threshold

Authors: Fan Peng, Hao Chen, Chang-An Zhao

Abstract: In Chen-Cramer Crypto 2006 paper \cite{cc} algebraic geometric secret sharing schemes were proposed such that the "Fundamental Theorem in Information-Theoretically Secure Multiparty Computation" by Ben-Or, Goldwasser and Wigderson \cite{BGW88} and Chaum, Crépeau and Damgård \cite{CCD88} can be established over constant-size base finite fields. These algebraic geometric secret sharing schemes defin… ▽ More In Chen-Cramer Crypto 2006 paper \cite{cc} algebraic geometric secret sharing schemes were proposed such that the "Fundamental Theorem in Information-Theoretically Secure Multiparty Computation" by Ben-Or, Goldwasser and Wigderson \cite{BGW88} and Chaum, Crépeau and Damgård \cite{CCD88} can be established over constant-size base finite fields. These algebraic geometric secret sharing schemes defined by a curve of genus $g$ over a constant size finite field ${\bf F}_q$ is quasi-threshold in the following sense, any subset of $u \leq T-1$ players (non qualified) has no information of the secret and any subset of $u \geq T+2g$ players (qualified) can reconstruct the secret. It is natural to ask that how far from the threshold these quasi-threshold secret sharing schemes are? How many subsets of $u \in [T, T+2g-1]$ players can recover the secret or have no information of the secret? In this paper it is proved that almost all subsets of $u \in [T,T+g-1]$ players have no information of the secret and almost all subsets of $u \in [T+g,T+2g-1]$ players can reconstruct the secret when the size $q$ goes to the infinity and the genus satisfies $\lim \frac{g}{\sqrt{q}}=0$. Then algebraic geometric secret sharing schemes over large finite fields are asymptotically threshold in this case. We also analyze the case when the size $q$ of the base field is fixed and the genus goes to the infinity. △ Less

Submitted 4 January, 2021; originally announced January 2021.

arXiv:2012.00898 [pdf, ps, other]

Federated Marginal Personalization for ASR Rescoring

Authors: Zhe Liu, Fuchun Peng

Abstract: We introduce federated marginal personalization (FMP), a novel method for continuously updating personalized neural network language models (NNLMs) on private devices using federated learning (FL). Instead of fine-tuning the parameters of NNLMs on personal data, FMP regularly estimates global and personalized marginal distributions of words, and adjusts the probabilities from NNLMs by an adaptatio… ▽ More We introduce federated marginal personalization (FMP), a novel method for continuously updating personalized neural network language models (NNLMs) on private devices using federated learning (FL). Instead of fine-tuning the parameters of NNLMs on personal data, FMP regularly estimates global and personalized marginal distributions of words, and adjusts the probabilities from NNLMs by an adaptation factor that is specific to each word. Our presented approach can overcome the limitations of federated fine-tuning and efficiently learn personalized NNLMs on devices. We study the application of FMP on second-pass ASR rescoring tasks. Experiments on two speech evaluation datasets show modest word error rate (WER) reductions. We also demonstrate that FMP could offer reasonable privacy with only a negligible cost in speech recognition accuracy. △ Less

Submitted 1 December, 2020; originally announced December 2020.

arXiv:2011.04785 [pdf, ps, other]

Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

Authors: Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

Abstract: In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identi… ▽ More In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identical datasets and encoder model architecture. We find that RNN-T has consistent wins in ASR accuracy, while CTC models excel at inference efficiency. Moreover, we selectively examine various modeling strategies for different training criteria, including modeling units, encoder architectures, pre-training, etc. Given such large-scale real-world streaming ASR application, to our best knowledge, we present the first comprehensive benchmark on these three widely used training criteria across a great many languages. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2021

arXiv:2003.03701 [pdf, other]

Unifying Specialist Image Embedding into Universal Image Embedding

Authors: Yang Feng, Futang Peng, Xu Zhang, Wei Zhu, Shanfeng Zhang, Howard Zhou, Zhen Li, Tom Duerig, Shih-Fu Chang, Jiebo Luo

Abstract: Deep image embedding provides a way to measure the semantic similarity of two images. It plays a central role in many applications such as image search, face verification, and zero-shot learning. It is desirable to have a universal deep embedding model applicable to various domains of images. However, existing methods mainly rely on training specialist embedding models each of which is applicable… ▽ More Deep image embedding provides a way to measure the semantic similarity of two images. It plays a central role in many applications such as image search, face verification, and zero-shot learning. It is desirable to have a universal deep embedding model applicable to various domains of images. However, existing methods mainly rely on training specialist embedding models each of which is applicable to images from a single domain. In this paper, we study an important but unexplored task: how to train a single universal image embedding model to match the performance of several specialists on each specialist's domain. Simply fusing the training data from multiple domains cannot solve this problem because some domains become overfitted sooner when trained together using existing methods. Therefore, we propose to distill the knowledge in multiple specialists into a universal embedding to solve this problem. In contrast to existing embedding distillation methods that distill the absolute distances between images, we transform the absolute distances between images into a probabilistic distribution and minimize the KL-divergence between the distributions of the specialists and the universal embedding. Using several public datasets, we validate that our proposed method accomplishes the goal of universal image embedding. △ Less

Submitted 7 March, 2020; originally announced March 2020.

arXiv:2002.10242 [pdf, other]

Age of Information Optimized MAC in V2X Sidelink via Piggyback-Based Collaboration

Authors: Fei Peng, Zhiyuan Jiang, Shunqing Zhang, Shugong Xu

Abstract: Real-time status update in future vehicular networks is vital to enable control-level cooperative autonomous driving. Cellular Vehicle-to-Everything (C-V2X), as one of the most promising vehicular wireless technologies, adopts a Semi-Persistent Scheduling (SPS) based Medium-Access-Control (MAC) layer protocol for its sidelink communications. Despite the recent and ongoing efforts to optimize SPS,… ▽ More Real-time status update in future vehicular networks is vital to enable control-level cooperative autonomous driving. Cellular Vehicle-to-Everything (C-V2X), as one of the most promising vehicular wireless technologies, adopts a Semi-Persistent Scheduling (SPS) based Medium-Access-Control (MAC) layer protocol for its sidelink communications. Despite the recent and ongoing efforts to optimize SPS, very few work has considered the status update performance of SPS. In this paper, Age of Information (AoI) is first leveraged to evaluate the MAC layer performance of C-V2X sidelink. Critical issues of SPS, i.e., persistent packet collisions and Half-Duplex (HD) effects, are identified to hinder its AoI performance. Therefore, a piggyback-based collaboration method is proposed accordingly, whereby vehicles collaborate to inform each other of potential collisions and collectively afford HD errors, while entailing only a small signaling overhead. Closed-form AoI performance is derived for the proposed scheme, optimal configurations for key parameters are hence calculated, and the convergence property is proved for decentralized implementation. Simulation results show that compared with the standardized SPS and its state-of-the-art enhancement schemes, the proposed scheme shows significantly better performance, not only in terms of AoI, but also of conventional metrics such as transmission reliability. △ Less

Submitted 13 April, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: Submitted to IEEE TWC for possible publication

arXiv:2002.01255 [pdf, other]

Revealing Much While Saying Less: Predictive Wireless for Status Update

Authors: Zhiyuan Jiang, Zixu Cao, Siyu Fu, Fei Peng, Shan Cao, Shunqing Zhang, Shugong Xu

Abstract: Wireless communications for status update are becoming increasingly important, especially for machine-type control applications. Existing work has been mainly focused on Age of Information (AoI) optimizations. In this paper, a status-aware predictive wireless interface design, networking and implementation are presented which aim to minimize the status recovery error of a wireless networked system… ▽ More Wireless communications for status update are becoming increasingly important, especially for machine-type control applications. Existing work has been mainly focused on Age of Information (AoI) optimizations. In this paper, a status-aware predictive wireless interface design, networking and implementation are presented which aim to minimize the status recovery error of a wireless networked system by leveraging online status model predictions. Two critical issues of predictive status update are addressed: practicality and usefulness. Link-level experiments on a Software-Defined-Radio (SDR) testbed are conducted and test results show that the proposed design can significantly reduce the number of wireless transmissions while maintaining a low status recovery error. A Status-aware Multi-Agent Reinforcement learning neTworking solution (SMART) is proposed to dynamically and autonomously control the transmit decisions of devices in an ad hoc network based on their individual statuses. System-level simulations of a multi dense platooning scenario are carried out on a road traffic simulator. Results show that the proposed schemes can greatly improve the platooning control performance in terms of the minimum safe distance between successive vehicles, in comparison with the AoI-optimized status-unaware and communication latency-optimized schemes---this demonstrates the usefulness of our proposed status update schemes in a real-world application. △ Less

Submitted 4 February, 2020; originally announced February 2020.

Comments: To appear in IEEE INFOCOM 2020

arXiv:1912.09508 [pdf, other]

Statistical Testing on ASR Performance via Blockwise Bootstrap

Authors: Zhe Liu, Fuchun Peng

Abstract: A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance ana… ▽ More A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance analysis which is intuitive and easy to use. However, this method fails in dealing with dependent data, which is prevalent in speech world - for example, ASR performance on utterances from the same speaker could be correlated. In this paper we present blockwise bootstrap approach - by dividing evaluation utterances into nonoverlap** blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of absolute WER difference between two ASR systems is consistent under mild conditions. We also demonstrate the validity of blockwise bootstrap method on both synthetic and real-world speech data. △ Less

Submitted 20 May, 2020; v1 submitted 19 December, 2019; originally announced December 2019.

Comments: 6 pages, 2 figures

arXiv:1911.10235 [pdf, other]

Improving N-gram Language Models with Pre-trained Deep Transformer

Authors: Yiren Wang, Hongzhao Huang, Zhe Liu, Yutong Pang, Yongqiang Wang, ChengXiang Zhai, Fuchun Peng

Abstract: Although n-gram language models (LMs) have been outperformed by the state-of-the-art neural LMs, they are still widely used in speech recognition due to its high efficiency in inference. In this paper, we demonstrate that n-gram LM can be improved by neural LMs through a text generation based data augmentation method. In contrast to previous approaches, we employ a large-scale general domain pre-t… ▽ More Although n-gram language models (LMs) have been outperformed by the state-of-the-art neural LMs, they are still widely used in speech recognition due to its high efficiency in inference. In this paper, we demonstrate that n-gram LM can be improved by neural LMs through a text generation based data augmentation method. In contrast to previous approaches, we employ a large-scale general domain pre-training followed by in-domain fine-tuning strategy to construct deep Transformer based neural LMs. Large amount of in-domain text data is generated with the well trained deep Transformer to construct new n-gram LMs, which are then interpolated with baseline n-gram systems. Empirical studies on different speech recognition tasks show that the proposed approach can effectively improve recognition accuracy. In particular, our proposed approach brings significant relative word error rate reduction up to 6.0% for domains with limited in-domain data. △ Less

Submitted 22 November, 2019; originally announced November 2019.

arXiv:1911.07874 [pdf, other]

RWNE: A Scalable Random-Walk-Based Network Embedding Framework with Personalized Higher-Order Proximity Preserved

Authors: Jianxin Li, Cheng Ji, Hao Peng, Yu He, Yangqiu Song, Xinmiao Zhang, Fanzhang Peng

Abstract: Higher-order proximity preserved network embedding has attracted increasing attention. In particular, due to the superior scalability, random-walk-based network embedding has also been well developed, which could efficiently explore higher-order neighborhoods via multi-hop random walks. However, despite the success of current random-walk-based methods, most of them are usually not expressive enoug… ▽ More Higher-order proximity preserved network embedding has attracted increasing attention. In particular, due to the superior scalability, random-walk-based network embedding has also been well developed, which could efficiently explore higher-order neighborhoods via multi-hop random walks. However, despite the success of current random-walk-based methods, most of them are usually not expressive enough to preserve the personalized higher-order proximity and lack a straightforward objective to theoretically articulate what and how network proximity is preserved. In this paper, to address the above issues, we present a general scalable random-walk-based network embedding framework, in which random walk is explicitly incorporated into a sound objective designed theoretically to preserve arbitrary higher-order proximity. Further, we introduce the random walk with restart process into the framework to naturally and effectively achieve personalized-weighted preservation of proximities of different orders. We conduct extensive experiments on several real-world networks and demonstrate that our proposed method consistently and substantially outperforms the state-of-the-art network embedding methods. △ Less

Submitted 7 April, 2021; v1 submitted 18 November, 2019; originally announced November 2019.

arXiv:1910.12367 [pdf, other]

Training ASR models by Generation of Contextual Information

Authors: Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

Abstract: Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised lea… ▽ More Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities. △ Less

Submitted 14 February, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

arXiv:1910.11450 [pdf, ps, other]

An Empirical Study of Efficient ASR Rescoring with Transformers

Authors: Hongzhao Huang, Fuchun Peng

Abstract: Neural language models (LMs) have been proved to significantly outperform classical n-gram LMs for language modeling due to their superior abilities to model long-range dependencies in text and handle data sparsity problems. And recently, well configured deep Transformers have exhibited superior performance over shallow stack of recurrent neural network layers for language modeling. However, these… ▽ More Neural language models (LMs) have been proved to significantly outperform classical n-gram LMs for language modeling due to their superior abilities to model long-range dependencies in text and handle data sparsity problems. And recently, well configured deep Transformers have exhibited superior performance over shallow stack of recurrent neural network layers for language modeling. However, these state-of-the-art deep Transformer models were mostly engineered to be deep with high model capacity, which makes it computationally inefficient and challenging to be deployed into large-scale real-world applications. Therefore, it is important to develop Transformer LMs that have relatively small model sizes, while still retaining good performance of those much larger models. In this paper, we aim to conduct empirical study on training Transformers with small parameter sizes in the context of ASR rescoring. By combining techniques including subword units, adaptive softmax, large-scale model pre-training, and knowledge distillation, we show that we are able to successfully train small Transformer LMs with significant relative word error rate reductions (WERR) through n-best rescoring. In particular, our experiments on a video speech recognition dataset show that we are able to achieve WERRs ranging from 6.46% to 7.17% while only with 5.5% to 11.9% parameter sizes of the well-known large GPT model [1], whose WERR with rescoring on the same dataset is 7.58%. △ Less

Submitted 24 October, 2019; originally announced October 2019.

Comments: 5 pages, 5 tables

arXiv:1910.10670 [pdf, other]

Efficient Dynamic WFST Decoding for Personalized Language Models

Authors: Jun Liu, Jiedan Zhu, Vishal Kathuria, Fuchun Peng

Abstract: We propose a two-layer cache mechanism to speed up dynamic WFST decoding with personalized language models. The first layer is a public cache that stores most of the static part of the graph. This is shared globally among all users. A second layer is a private cache that caches the graph that represents the personalized language model, which is only shared by the utterances from a particular user.… ▽ More We propose a two-layer cache mechanism to speed up dynamic WFST decoding with personalized language models. The first layer is a public cache that stores most of the static part of the graph. This is shared globally among all users. A second layer is a private cache that caches the graph that represents the personalized language model, which is only shared by the utterances from a particular user. We also propose two simple yet effective pre-initialization methods, one based on breadth-first search, and another based on a data-driven exploration of decoder states using previous utterances. Experiments with a calling speech recognition task using a personalized contact list demonstrate that the proposed public cache reduces decoding time by factor of three compared to decoding without pre-initialization. Using the private cache provides additional efficiency gains, reducing the decoding time by a factor of five. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: 5 pages, 4 figures

arXiv:1910.07117 [pdf, other]

Analyzing the Forgetting Problem in the Pretrain-Finetuning of Dialogue Response Models

Authors: Tianxing He, Jun Liu, Kyunghyun Cho, Myle Ott, Bing Liu, James Glass, Fuchun Peng

Abstract: In this work, we study how the finetuning stage in the pretrain-finetune framework changes the behavior of a pretrained neural language generator. We focus on the transformer encoder-decoder model for the open-domain dialogue response generation task. Our major finding is that after standard finetuning, the model forgets some of the important language generation skills acquired during large-scale… ▽ More In this work, we study how the finetuning stage in the pretrain-finetune framework changes the behavior of a pretrained neural language generator. We focus on the transformer encoder-decoder model for the open-domain dialogue response generation task. Our major finding is that after standard finetuning, the model forgets some of the important language generation skills acquired during large-scale pretraining. We demonstrate the forgetting phenomenon through a set of detailed behavior analysis from the perspectives of knowledge transfer, context sensitivity, and function space projection. As a preliminary attempt to alleviate the forgetting problem, we propose an intuitive finetuning strategy named "mix-review". We find that mix-review effectively regularizes the finetuning process, and the forgetting problem is alleviated to some extent. Finally, we discuss interesting behavior of the resulting dialogue model and its implications. △ Less

Submitted 16 January, 2021; v1 submitted 15 October, 2019; originally announced October 2019.

Journal ref: EACL 2021

arXiv:1907.08489 [pdf, other]

Empowering A* Search Algorithms with Neural Networks for Personalized Route Recommendation

Authors: **gyuan Wang, Ning Wu, Wayne Xin Zhao, Fanzhang Peng, Xin Lin

Abstract: Personalized Route Recommendation (PRR) aims to generate user-specific route suggestions in response to users' route queries. Early studies cast the PRR task as a pathfinding problem on graphs, and adopt adapted search algorithms by integrating heuristic strategies. Although these methods are effective to some extent, they require setting the cost functions with heuristics. In addition, it is diff… ▽ More Personalized Route Recommendation (PRR) aims to generate user-specific route suggestions in response to users' route queries. Early studies cast the PRR task as a pathfinding problem on graphs, and adopt adapted search algorithms by integrating heuristic strategies. Although these methods are effective to some extent, they require setting the cost functions with heuristics. In addition, it is difficult to utilize useful context information in the search procedure. To address these issues, we propose using neural networks to automatically learn the cost functions of a classic heuristic algorithm, namely A* algorithm, for the PRR task. Our model consists of two components. First, we employ attention-based Recurrent Neural Networks (RNN) to model the cost from the source to the candidate location by incorporating useful context information. Instead of learning a single cost value, the RNN component is able to learn a time-varying vectorized representation for the moving state of a user. Second, we propose to use a value network for estimating the cost from a candidate location to the destination. For capturing structural characteristics, the value network is built on top of improved graph attention networks by incorporating the moving state of a user and other context information. The two components are integrated in a principled way for deriving a more accurate cost of a candidate location. Extensive experiment results on three real-world datasets have shown the effectiveness and robustness of the proposed model. △ Less

Submitted 19 July, 2019; originally announced July 2019.

Comments: 9 pages, 25TH ACM SIGKDD Conference On Knowledge Discovery And Data Mining

arXiv:1902.10814 [pdf, other]

Graph-RISE: Graph-Regularized Image Semantic Embedding

Authors: Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, Sujith Ravi

Abstract: Learning image representations to capture fine-grained semantics has been a challenging and important task enabling many applications such as image search and clustering. In this paper, we present Graph-Regularized Image Semantic Embedding (Graph-RISE), a large-scale neural graph learning framework that allows us to train embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semant… ▽ More Learning image representations to capture fine-grained semantics has been a challenging and important task enabling many applications such as image search and clustering. In this paper, we present Graph-Regularized Image Semantic Embedding (Graph-RISE), a large-scale neural graph learning framework that allows us to train embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels. Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking. We provide case studies to demonstrate that, qualitatively, image retrieval based on Graph-RISE effectively captures semantics and, compared to the state-of-the-art, differentiates nuances at levels that are closer to human-perception. △ Less

Submitted 13 February, 2019; originally announced February 2019.

Comments: 9 pages, 7 figures

arXiv:1811.07665 [pdf]

FD-GAN: Face-demorphing generative adversarial network for restoring accomplice's facial image

Authors: Fei Peng, Le-bing Zhang, Min Long

Abstract: Face morphing attack is proved to be a serious threat to the existing face recognition systems. Although a few face morphing detection methods have been put forward, the face morphing accomplice's facial restoration remains a challenging problem. In this paper, a face de-morphing generative adversarial network (FD-GAN) is proposed to restore the accomplice's facial image. It utilizes a symmetric d… ▽ More Face morphing attack is proved to be a serious threat to the existing face recognition systems. Although a few face morphing detection methods have been put forward, the face morphing accomplice's facial restoration remains a challenging problem. In this paper, a face de-morphing generative adversarial network (FD-GAN) is proposed to restore the accomplice's facial image. It utilizes a symmetric dual network architecture and two levels of restoration losses to separate the identity feature of the morphing accomplice. By exploiting the captured facial image (containing the criminal's identity) from the face recognition system and the morphed image stored in the e-passport system (containing both criminal and accomplice's identities), the FD-GAN can effectively restore the accomplice's facial image. Experimental results and analysis demonstrate the effectiveness of the proposed scheme. It has great potential to be implemented for detecting the face morphing accomplice in a real identity verification scenario. △ Less

Submitted 21 March, 2019; v1 submitted 19 November, 2018; originally announced November 2018.

Comments: 9 pages, 7 figures

arXiv:1811.00253 [pdf, other]

Hybrid Self-Attention Network for Machine Translation

Authors: Kaitao Song, Xu Tan, Furong Peng, Jianfeng Lu

Abstract: The encoder-decoder is the typical framework for Neural Machine Translation (NMT), and different structures have been developed for improving the translation performance. Transformer is one of the most promising structures, which can leverage the self-attention mechanism to capture the semantic dependency from global view. However, it cannot distinguish the relative position of different tokens ve… ▽ More The encoder-decoder is the typical framework for Neural Machine Translation (NMT), and different structures have been developed for improving the translation performance. Transformer is one of the most promising structures, which can leverage the self-attention mechanism to capture the semantic dependency from global view. However, it cannot distinguish the relative position of different tokens very well, such as the tokens located at the left or right of the current token, and cannot focus on the local information around the current token either. To alleviate these problems, we propose a novel attention mechanism named Hybrid Self-Attention Network (HySAN) which accommodates some specific-designed masks for self-attention network to extract various semantic, such as the global/local information, the left/right part context. Finally, a squeeze gate is introduced to combine different kinds of SANs for fusion. Experimental results on three machine translation tasks show that our proposed framework outperforms the Transformer baseline significantly and achieves superior results over state-of-the-art NMT systems. △ Less

Submitted 10 December, 2018; v1 submitted 1 November, 2018; originally announced November 2018.

arXiv:1801.02691 [pdf, other]

doi 10.1145/3173574.3173827

A Trip to the Moon: Personalized Animated Movies for Self-reflection

Authors: Fengjiao Peng, Veronica LaBelle, Emily Yue, Rosalind Picard

Abstract: Self-tracking physiological and psychological data poses the challenge of presentation and interpretation. Insightful narratives for self-tracking data can motivate the user towards constructive self-reflection. One powerful form of narrative that engages audience across various culture and age groups is animated movies. We collected a week of self-reported mood and behavior data from each user an… ▽ More Self-tracking physiological and psychological data poses the challenge of presentation and interpretation. Insightful narratives for self-tracking data can motivate the user towards constructive self-reflection. One powerful form of narrative that engages audience across various culture and age groups is animated movies. We collected a week of self-reported mood and behavior data from each user and created in Unity a personalized animation based on their data. We evaluated the impact of their video in a randomized control trial with a non-personalized animated video as control. We found that personalized videos tend to be more emotionally engaging, encouraging greater and lengthier writing that indicated self-reflection about moods and behaviors, compared to non-personalized control videos. △ Less

Submitted 8 January, 2018; originally announced January 2018.

ACM Class: H.5.1

Journal ref: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 2018

arXiv:1708.04670 [pdf, other]

DeepFaceLIFT: Interpretable Personalized Models for Automatic Estimation of Self-Reported Pain

Authors: Dianbo Liu, Fengjiao Peng, Andrew Shea, Ognjen, Rudovic, Rosalind Picard

Abstract: Previous research on automatic pain estimation from facial expressions has focused primarily on "one-size-fits-all" metrics (such as PSPI). In this work, we focus on directly estimating each individual's self-reported visual-analog scale (VAS) pain metric, as this is considered the gold standard for pain measurement. The VAS pain score is highly subjective and context-dependent, and its range can… ▽ More Previous research on automatic pain estimation from facial expressions has focused primarily on "one-size-fits-all" metrics (such as PSPI). In this work, we focus on directly estimating each individual's self-reported visual-analog scale (VAS) pain metric, as this is considered the gold standard for pain measurement. The VAS pain score is highly subjective and context-dependent, and its range can vary significantly among different persons. To tackle these issues, we propose a novel two-stage personalized model, named DeepFaceLIFT, for automatic estimation of VAS. This model is based on (1) Neural Network and (2) Gaussian process regression models, and is used to personalize the estimation of self-reported pain via a set of hand-crafted personal features and multi-task learning. We show on the benchmark dataset for pain analysis (The UNBC-McMaster Shoulder Pain Expression Archive) that the proposed personalized model largely outperforms the traditional, unpersonalized models: the intra-class correlation improves from a baseline performance of 19\% to a personalized performance of 35\% while also providing confidence in the model\textquotesingle s estimates -- in contrast to existing models for the target task. Additionally, DeepFaceLIFT automatically discovers the pain-relevant facial regions for each person, allowing for an easy interpretation of the pain-related facial cues. △ Less

Submitted 9 August, 2017; originally announced August 2017.

arXiv:1612.05793 [pdf, ps, other]

Autonomous Localization and Map** Using a Single Mobile Device

Authors: Tiexing Wang, Fangrong Peng, Biao Chen

Abstract: This paper considers the problem of simultaneous 2-D room shape reconstruction and self-localization without the requirement of any pre-established infrastructure. A mobile device equipped with co-located microphone and loudspeaker as well as internal motion sensors is used to emit acoustic pulses and collect echoes reflected by the walls. Using only first order echoes, room shape recovery and sel… ▽ More This paper considers the problem of simultaneous 2-D room shape reconstruction and self-localization without the requirement of any pre-established infrastructure. A mobile device equipped with co-located microphone and loudspeaker as well as internal motion sensors is used to emit acoustic pulses and collect echoes reflected by the walls. Using only first order echoes, room shape recovery and self-localization is feasible when auxiliary information is obtained using motion sensors. In particular, it is established that using echoes collected at three measurement locations and the two distances between consecutive measurement points, unique localization and map** can be achieved provided that the three measurement points are not collinear. Practical algorithms for room shape reconstruction and self-localization in the presence of noise and higher order echoes are proposed along with experimental results to demonstrate the effectiveness of the proposed approach. △ Less

Submitted 17 December, 2016; originally announced December 2016.

Comments: Submitted to the IEEE Transactions on Audio, Speech and Language Processing

arXiv:1611.06026 [pdf, other]

Cross Domain Knowledge Transfer for Person Re-identification

Authors: Qiqi Xiao, Kelei Cao, Haonan Chen, Fangyue Peng, Chi Zhang

Abstract: Person Re-Identification (re-id) is a challenging task in computer vision, especially when there are limited training data from multiple camera views. In this paper, we pro- pose a deep learning based person re-identification method by transferring knowledge of mid-level attribute features and high-level classification features. Building on the idea that identity classification, attribute recognit… ▽ More Person Re-Identification (re-id) is a challenging task in computer vision, especially when there are limited training data from multiple camera views. In this paper, we pro- pose a deep learning based person re-identification method by transferring knowledge of mid-level attribute features and high-level classification features. Building on the idea that identity classification, attribute recognition and re- identification share the same mid-level semantic representations, they can be trained sequentially by fine-tuning one based on another. In our framework, we train identity classification and attribute recognition tasks from deep Convolutional Neural Network (dCNN) to learn person information. The information can be transferred to the person re-id task and improves its accuracy by a large margin. Further- more, a Long Short Term Memory(LSTM) based Recurrent Neural Network (RNN) component is extended by a spacial gate. This component is used in the re-id model to pay attention to certain spacial parts in each recurrent unit. Experimental results show that our method achieves 78.3% of rank-1 recognition accuracy on the CUHK03 benchmark. △ Less

Submitted 18 November, 2016; originally announced November 2016.

Comments: 8 pages

arXiv:1611.03264 [pdf, other]

A Memristor Crossbar-Based Computation Scheme with High Precision

Authors: Junyi Li, Fulin Peng, Fan Yang, Xuan Zeng

Abstract: The memristor is promising to be the basic cell of next-generation computation systems. Compared to the traditional MOSFET device, the memristor is efficient over energy and area. But one of the biggest challenges faced with researchers is how to program a memristor's resistance precisely. Recently, an algorithm designed to save 8 valid bits in each memristor is proposed, but this is still not suf… ▽ More The memristor is promising to be the basic cell of next-generation computation systems. Compared to the traditional MOSFET device, the memristor is efficient over energy and area. But one of the biggest challenges faced with researchers is how to program a memristor's resistance precisely. Recently, an algorithm designed to save 8 valid bits in each memristor is proposed, but this is still not sufficient for precise computation. In this paper, we propose a crossbar-based memristor computation scheme supporting precise computations whose operands have 32 valid bits. As a brief introduction, in a multiplication with two operands, one operand is programmed as input signal, and the other operand is saved into a so-called crossbar structure, which contains a group of memristors, and each memristor saves several valid bits, usually one or two bits only. The computation results,i.e. the multiplication of the two operands, are contained in the outputs of the crossbar structure together with noise. Analog-to-Digital Converters (ADCs) are then used to extract the valid bits, which are the most significant bits of outputs. These valid bits can be combined together with Digital-to-Analog Converters(DACs) to get the final results. What's more, the precision of this computation scheme can be adjusted according to the definition of the user, 32 valid bits at most, so it is qualified for different application contexts. △ Less

Submitted 19 November, 2016; v1 submitted 10 November, 2016; originally announced November 2016.

Comments: 6 pages,5 figures,conference

arXiv:1504.00150 [pdf, other]

Discovering Restricted Regular Expressions with Interleaving

Authors: Feifei Peng, Haiming Chen

Abstract: Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot b… ▽ More Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal. △ Less

Submitted 1 April, 2015; originally announced April 2015.

Comments: 12 pages

arXiv:1212.2514 [pdf]

Boltzmann Machine Learning with the Latent Maximum Entropy Principle

Authors: Shaojun Wang, Dale Schuurmans, Fuchun Peng, Yunxin Zhao

Abstract: We present a new statistical learning paradigm for Boltzmann machines based on a new inference principle we have proposed: the latent maximum entropy principle (LME). LME is different both from Jaynes maximum entropy principle and from standard maximum likelihood estimation.We demonstrate the LME principle BY deriving new algorithms for Boltzmann machine parameter estimation, and show how robust a… ▽ More We present a new statistical learning paradigm for Boltzmann machines based on a new inference principle we have proposed: the latent maximum entropy principle (LME). LME is different both from Jaynes maximum entropy principle and from standard maximum likelihood estimation.We demonstrate the LME principle BY deriving new algorithms for Boltzmann machine parameter estimation, and show how robust and fast new variant of the EM algorithm can be developed.Our experiments show that estimation based on LME generally yields better results than maximum likelihood estimation, particularly when inferring hidden units from small amounts of data. △ Less

Submitted 19 October, 2012; originally announced December 2012.

Comments: Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

Report number: UAI-P-2003-PG-567-574

arXiv:1207.4157 [pdf]

An Integrated, Conditional Model of Information Extraction and Coreference with Applications to Citation Matching

Authors: Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay

Abstract: Although information extraction and coreference resolution appear together in many applications, most current systems perform them as ndependent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure… ▽ More Although information extraction and coreference resolution appear together in many applications, most current systems perform them as ndependent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure based on graph partitioning. On a data set of research paper citations, we show significant reduction in error by using extraction uncertainty to improve coreference citation matching accuracy, and using coreference to improve the accuracy of the extracted fields. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-593-601

Showing 1–46 of 46 results for author: Peng, F