-
Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems
Authors:
Mingyu Cui,
Jiawen Kang,
Jiajun Deng,
Xi Yin,
Yutao Xie,
Xie Chen,
Xunying Liu
Abstract:
Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of t…
▽ More
Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of transformer context embeddings, in this paper compact low-dimensional cross utterance contextual features are learned in the Conformer-Transducer Encoder using specially designed attention pooling layers that are applied over efficiently cached preceding utterances history vectors. Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline using utterance internal context only with statistically significant WER reductions of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on the dev and test data.
△ Less
Submitted 25 June, 2023; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Chiral and nonreciprocal single-photon scattering in a chiral-giant-molecule waveguide-QED system
Authors:
Juan Zhou,
Xian-Li Yin,
Jie-Qiao Liao
Abstract:
We study chiral and nonreciprocal single-photon scattering in a chiral-giant-molecule waveguide-QED system. Here, the giant molecule consists of two coupled giant atoms, which interact with two linear waveguides, forming a four-port quantum device. We obtain the exact analytical expressions of the four scattering amplitudes using a real-space method. Under the Markovian limit, we find that the sin…
▽ More
We study chiral and nonreciprocal single-photon scattering in a chiral-giant-molecule waveguide-QED system. Here, the giant molecule consists of two coupled giant atoms, which interact with two linear waveguides, forming a four-port quantum device. We obtain the exact analytical expressions of the four scattering amplitudes using a real-space method. Under the Markovian limit, we find that the single-photon scattering behavior is determined by the coupling strength between the giant atoms and the waveguides, the coupling strength between the two giant atoms, and the nondipole effect caused by the phase accumulation of photons travelling between the coupling points. It is also found that chiral and nonreciprocal single-photon scattering can be realized by introducing the chiral coupling to break the symmetry in the coupling configuration between the giant molecule and the waveguides. In addition, an ideal chiral emitter-waveguide coupling enables a directional single-photon routing. In the non-Markovian regime, the scattering spectra are characterized by more abundant structures with multiple peaks and dips. In particular, we demonstrate that the non-Markovian retarded effect can induce the nonreciprocal single-photon scattering. Our results have potential applications in the design of optical quantum devices involving giant atoms, which can provide an efficient platform for studying chiral quantum optics.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Ultra-low-loss optical interconnect enabled by topological unidirectional guided resonance
Authors:
Haoran Wang,
Yi Zuo,
Xuefan Yin,
Zihao Chen,
Zixuan Zhang,
Feifan Wang,
Yuefeng Hu,
Xiaoyu Zhang,
Chao Peng
Abstract:
Grating couplers that interconnect photonic chips to off-chip components are of essential importance for various optoelectronics applications. Despite numerous efforts in past decades, existing grating couplers still suffer from poor energy efficiency and thus hinder photonic integration toward a larger scale. Here, we theoretically propose and experimentally demonstrate a method to achieve ultra-…
▽ More
Grating couplers that interconnect photonic chips to off-chip components are of essential importance for various optoelectronics applications. Despite numerous efforts in past decades, existing grating couplers still suffer from poor energy efficiency and thus hinder photonic integration toward a larger scale. Here, we theoretically propose and experimentally demonstrate a method to achieve ultra-low-loss grating coupler by employing topological unidirectional guided resonances (UGRs). Leveraging the unidirectional emitting nature of UGRs, the useless downward radiation is greatly suppressed with no mirror placed on the bottom. By engineering the dispersion and apodizing the geometry of grating, we realize a grating coupler on 340 nm silicon-on-insulator platform with a record-low-loss of 0.34 dB and bandwidth exceeding 30 nm at the telecom wavelength of 1550 nm. We further show a pair of grating couplers works as optic via that interconnects two stacked photonic chips with a loss of only 0.94 dB. Our work sheds light on the feasibility of energy-efficient optical interconnect for silicon photonics, and paving the way to large-scale photonic integration for applications from optical communication to photonic computing.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects
Authors:
Xinghua Qu,
Hongyang Liu,
Zhu Sun,
Xiang Yin,
Yew Soon Ong,
Lu Lu,
Zejun Ma
Abstract:
Conversational recommender systems (CRSs) have become crucial emerging research topics in the field of RSs, thanks to their natural advantages of explicitly acquiring user preferences via interactive conversations and revealing the reasons behind recommendations. However, the majority of current CRSs are text-based, which is less user-friendly and may pose challenges for certain users, such as tho…
▽ More
Conversational recommender systems (CRSs) have become crucial emerging research topics in the field of RSs, thanks to their natural advantages of explicitly acquiring user preferences via interactive conversations and revealing the reasons behind recommendations. However, the majority of current CRSs are text-based, which is less user-friendly and may pose challenges for certain users, such as those with visual impairments or limited writing and reading abilities. Therefore, for the first time, this paper investigates the potential of voice-based CRS (VCRSs) to revolutionize the way users interact with RSs in a natural, intuitive, convenient, and accessible fashion. To support such studies, we create two VCRSs benchmark datasets in the e-commerce and movie domains, after realizing the lack of such datasets through an exhaustive literature review. Specifically, we first empirically verify the benefits and necessity of creating such datasets. Thereafter, we convert the user-item interactions to text-based conversations through the ChatGPT-driven prompts for generating diverse and natural templates, and then synthesize the corresponding audios via the text-to-speech model. Meanwhile, a number of strategies are delicately designed to ensure the naturalness and high quality of voice conversations. On this basis, we further explore the potential solutions and point out possible directions to build end-to-end VCRSs by seamlessly extracting and integrating voice-based inputs, thus delivering performance-enhanced, self-explainable, and user-friendly VCRSs. Our study aims to establish the foundation and motivate further pioneering research in the emerging field of VCRSs. This aligns with the principles of explainable AI and AI for social good, viz., utilizing technology's potential to create a fair, sustainable, and just world.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Computational and Storage Efficient Quadratic Neurons for Deep Neural Networks
Authors:
Chuangtao Chen,
Grace Li Zhang,
Xunzhao Yin,
Cheng Zhuo,
Ulf Schlichtmann,
Bing Li
Abstract:
Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structu…
▽ More
Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structures, the fundamental building blocks of neural networks, to alleviate the computational and storage cost. In this work, an efficient quadratic neuron architecture distinguished by its enhanced utilization of second-order computational information is introduced. By virtue of their better expressivity, DNNs employing the proposed quadratic neurons can attain similar accuracy with fewer neurons and computational cost. Experimental results have demonstrated that the proposed quadratic neuron structure exhibits superior computational and storage efficiency across various tasks when compared with both linear and non-linear neurons in prior work.
△ Less
Submitted 27 November, 2023; v1 submitted 10 June, 2023;
originally announced June 2023.
-
Extended Neighboring Extremal Optimal Control with State and Preview Perturbations
Authors:
Amin Vahidi-Moghaddam,
Kaixiang Zhang,
Zhaojian Li,
Xunyuan Yin,
Ziyou Song,
Yan Wang
Abstract:
Optimal control schemes have achieved remarkable performance in numerous engineering applications. However, they typically require high computational cost, which has limited their use in real-world engineering systems with fast dynamics and/or limited computation power. To address this challenge, Neighboring Extremal (NE) has been developed as an efficient optimal adaption strategy to adapt a pre-…
▽ More
Optimal control schemes have achieved remarkable performance in numerous engineering applications. However, they typically require high computational cost, which has limited their use in real-world engineering systems with fast dynamics and/or limited computation power. To address this challenge, Neighboring Extremal (NE) has been developed as an efficient optimal adaption strategy to adapt a pre-computed nominal control solution to perturbations from the nominal trajectory. The resulting control law is a time-varying feedback gain that can be pre-computed along with the original optimal control problem, and it takes negligible online computation. However, existing NE frameworks only deal with state perturbations while in modern applications, optimal controllers (e.g., predictive controllers) frequently incorporate preview information. Therefore, a new NE framework is needed to adapt to such preview perturbations. In this work, an extended NE (ENE) framework is developed to systematically adapt the nominal control to both state and preview perturbations. We show that the derived ENE law is two time-varying feedback gains on the state perturbation and the preview perturbation. We also develop schemes to handle nominal non-optimal solutions and large perturbations to retain optimal performance and constraint satisfaction. Case study on nonlinear model predictive control is presented due to its popularity but it can be easily extended to other optimal control schemes. Promising simulation results on the cart inverted pendulum problem demonstrate the efficacy of the ENE algorithm.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Authors:
Ziyue Jiang,
Yi Ren,
Zhenhui Ye,
**glin Liu,
Chen Zhang,
Qian Yang,
Shengpeng Ji,
Rongjie Huang,
Chunfeng Wang,
Xiang Yin,
Zejun Ma,
Zhou Zhao
Abstract:
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or un…
▽ More
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
Authors:
Zhenhui Ye,
Ziyue Jiang,
Yi Ren,
**glin Liu,
Chen Zhang,
Xiang Yin,
Zejun Ma,
Zhou Zhao
Abstract:
We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technic…
▽ More
We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.
△ Less
Submitted 2 August, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Detector Guidance for Multi-Object Text-to-Image Generation
Authors:
Lu** Liu,
Zijian Zhang,
Yi Ren,
Rongjie Huang,
Xiang Yin,
Zhou Zhao
Abstract:
Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images with text containing multiple objects is still restricted. Previous works identify the problem of information mixing in the CLIP text encoder and introdu…
▽ More
Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images with text containing multiple objects is still restricted. Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment. We find that mixing problems also occur on the image side and in the cross-attention blocks. The noisy images can cause different objects to appear similar, and the cross-attention blocks inject information at a pixel level, leading to leakage of global object understanding and resulting in object mixing. In this paper, we introduce Detector Guidance (DG), which integrates a latent object detection model to separate different objects during the generation process. DG first performs latent object detection on cross-attention maps (CAMs) to obtain object information. Based on this information, DG then masks conflicting prompts and enhances related prompts by manipulating the following CAMs. We evaluate the effectiveness of DG using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark, MRO. Human evaluations demonstrate that DG provides an 8-22\% advantage in preventing the amalgamation of conflicting concepts and ensuring that each object possesses its unique region without any human involvement and additional iterations. Our implementation is available at \url{https://github.com/lu**-liu/Detector-Guidance}.
△ Less
Submitted 3 June, 2023;
originally announced June 2023.
-
Continuous-Variable Quantum Key Distribution at 10 GBaud using an Integrated Photonic-Electronic Receiver
Authors:
Adnan A. E. Hajomer,
Cedric Bruynsteen,
Ivan Derkach,
Nitin Jain,
Axl Bomhals,
Sarah Bastiaens,
Ulrik L. Andersen,
Xin Yin,
Tobias Gehring
Abstract:
Quantum key distribution (QKD) is a well-known application of quantum information theory that guarantees information-theoretically secure key exchange. As QKD becomes more and more commercially viable, challenges such as scalability, network integration, and high production costs need to be addressed. Photonic and electronic integrated circuits that can be produced in large volumes at low cost hol…
▽ More
Quantum key distribution (QKD) is a well-known application of quantum information theory that guarantees information-theoretically secure key exchange. As QKD becomes more and more commercially viable, challenges such as scalability, network integration, and high production costs need to be addressed. Photonic and electronic integrated circuits that can be produced in large volumes at low cost hold the key to large-scale deployment of next-generation QKD systems. Here, we present a continuous-variable (CV) QKD system using an integrated photonic-electronic receiver that combines a silicon photonic integrated circuit implementing a phase-diverse receiver with custom-designed GaAs pHEMT transimpedance amplifiers. The QKD system operates at a classical telecom symbol rate of 10 GBaud, generating high secret key rates exceeding 0.7 Gb/s over a distance of 5 km and 0.3 Gb/s over a distance of 10 km. The secret keys are secure against collective attacks with finite-size effects taken into account. Well-designed digital signal processing enabled the high-speed operation. Our experiment sets a new record for secure quantum communication and paves the way for the next generation of CV-QKD systems.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
Authors:
Jiawei Huang,
Yi Ren,
Rongjie Huang,
Dongchao Yang,
Zhenhui Ye,
Chen Zhang,
**glin Liu,
Xiang Yin,
Zejun Ma,
Zhou Zhao
Abstract:
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since…
▽ More
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event & order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.
△ Less
Submitted 29 May, 2023;
originally announced May 2023.
-
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation
Authors:
Kun Song,
Yi Ren,
Yi Lei,
Chunfeng Wang,
Kun Wei,
Lei Xie,
Xiang Yin,
Zejun Ma
Abstract:
Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more p…
▽ More
Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more practical zero-shot scenarios. To solve this problem, we first build a parallel corpus using a multi-lingual multi-speaker text-to-speech synthesis (TTS) system and then propose the StyleS2ST model with cross-lingual speech style transfer ability based on a style adaptor on a direct S2ST system framework. Enabling continuous style space modeling of an acoustic model through parallel corpus training and non-parallel TTS data augmentation, StyleS2ST captures cross-lingual acoustic feature map** from the source to the target language. Experiments show that StyleS2ST achieves good style similarity and naturalness in both in-set and out-of-set zero-shot scenarios.
△ Less
Submitted 25 July, 2023; v1 submitted 28 May, 2023;
originally announced May 2023.
-
InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition
Authors:
Zhi-Hao Lai,
Tian-Hao Zhang,
Qi Liu,
Xinyuan Qian,
Li-Fang Wei,
Song-Lu Chen,
Feng Chen,
Xu-Cheng Yin
Abstract:
The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these…
▽ More
The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our proposed InterFormer and its superior performance over the other Transformer and Conformer models.
△ Less
Submitted 29 May, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Control invariant set enhanced safe reinforcement learning: improved sampling efficiency, guaranteed stability and robustness
Authors:
Song Bo,
Bernard T. Agyeman,
Xunyuan Yin,
**feng Liu
Abstract:
Reinforcement learning (RL) is an area of significant research interest, and safe RL in particular is attracting attention due to its ability to handle safety-driven constraints that are crucial for real-world applications. This work proposes a novel approach to RL training, called control invariant set (CIS) enhanced RL, which leverages the advantages of utilizing the explicit form of CIS to impr…
▽ More
Reinforcement learning (RL) is an area of significant research interest, and safe RL in particular is attracting attention due to its ability to handle safety-driven constraints that are crucial for real-world applications. This work proposes a novel approach to RL training, called control invariant set (CIS) enhanced RL, which leverages the advantages of utilizing the explicit form of CIS to improve stability guarantees and sampling efficiency. Furthermore, the robustness of the proposed approach is investigated in the presence of uncertainty. The approach consists of two learning stages: offline and online. In the offline stage, CIS is incorporated into the reward design, initial state sampling, and state reset procedures. This incorporation of CIS facilitates improved sampling efficiency during the offline training process. In the online stage, RL is retrained whenever the predicted next step state is outside of the CIS, which serves as a stability criterion, by introducing a Safety Supervisor to examine the safety of the action and make necessary corrections. The stability analysis is conducted for both cases, with and without uncertainty. To evaluate the proposed approach, we apply it to a simulated chemical reactor. The results show a significant improvement in sampling efficiency during offline training and closed-loop stability guarantee in the online implementation, with and without uncertainty.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Authors:
Rongjie Huang,
Huadai Liu,
Xize Cheng,
Yi Ren,
Linjun Li,
Zhenhui Ye,
**zheng He,
Lichao Zhang,
**glin Liu,
Xiang Yin,
Zhou Zhao
Abstract:
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual spe…
▽ More
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.io
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding
Authors:
Tian-Hao Zhang,
Hai-Bo Qin,
Zhi-Hao Lai,
Song-Lu Chen,
Qi Liu,
Feng Chen,
Xinyuan Qian,
Xu-Cheng Yin
Abstract:
Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla…
▽ More
Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Unsupervised Multi-view Pedestrian Detection
Authors:
Mengyin Liu,
Chao Zhu,
Shiqi Ren,
Xu-Cheng Yin
Abstract:
With the prosperity of the video surveillance, multiple cameras have been applied to accurately locate pedestrians in a specific area. However, previous methods rely on the human-labeled annotations in every video frame and camera view, leading to heavier burden than necessary camera calibration and synchronization. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detectio…
▽ More
With the prosperity of the video surveillance, multiple cameras have been applied to accurately locate pedestrians in a specific area. However, previous methods rely on the human-labeled annotations in every video frame and camera view, leading to heavier burden than necessary camera calibration and synchronization. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to eliminate the need of annotations to learn a multi-view pedestrian detector via 2D-3D map**. 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract unsupervised representations of multi-view images, which are converted into 2D pedestrian masks as pseudo labels, via our proposed iterative PCA and zero-shot semantic classes from vision-language models. 2) Secondly, we propose Geometry-aware Volume-based Detector (GVD) to end-to-end encode multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D rendering losses with SIS pseudo labels. 3) Thirdly, for better detection results, i.e., the 3D density projected on Birds-Eye-View from GVD, we propose Vertical-aware BEV Regularization (VBR) to constraint them to be vertical like the natural pedestrian poses. Extensive experiments on popular multi-view pedestrian detection benchmarks Wildtrack, Terrace, and MultiviewX, show that our proposed UMPD approach, as the first fully-unsupervised method to our best knowledge, performs competitively to the previous state-of-the-art supervised techniques. Code will be available.
△ Less
Submitted 19 November, 2023; v1 submitted 21 May, 2023;
originally announced May 2023.
-
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training
Authors:
Zhenhui Ye,
Rongjie Huang,
Yi Ren,
Ziyue Jiang,
**glin Liu,
**zheng He,
Xiang Yin,
Zhou Zhao
Abstract:
Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the s…
▽ More
Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker TTS. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in our method. Source code and audio samples are available at https://clapspeech.github.io.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
On the Hidden Mystery of OCR in Large Multimodal Models
Authors:
Yuliang Liu,
Zhang Li,
Biao Yang,
Chunyuan Li,
Xucheng Yin,
Cheng-lin Liu,
Lianwen **,
Xiang Bai
Abstract:
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Cent…
▽ More
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark.Our study encompasses 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.
△ Less
Submitted 17 January, 2024; v1 submitted 13 May, 2023;
originally announced May 2023.
-
Medical supervised masked autoencoders: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification
Authors:
Jiawei Mao,
Shujian Guo,
Yuanqi Chang,
Xuesong Yin,
Binling Nie
Abstract:
Masked autoencoders (MAEs) have displayed significant potential in the classification and semantic segmentation of medical images in the last year. Due to the high similarity of human tissues, even slight changes in medical images may represent diseased tissues, necessitating fine-grained inspection to pinpoint diseased tissues. The random masking strategy of MAEs is likely to result in areas of l…
▽ More
Masked autoencoders (MAEs) have displayed significant potential in the classification and semantic segmentation of medical images in the last year. Due to the high similarity of human tissues, even slight changes in medical images may represent diseased tissues, necessitating fine-grained inspection to pinpoint diseased tissues. The random masking strategy of MAEs is likely to result in areas of lesions being overlooked by the model. At the same time, inconsistencies between the pre-training and fine-tuning phases impede the performance and efficiency of MAE in medical image classification. To address these issues, we propose a medical supervised masked autoencoder (MSMAE) in this paper. In the pre-training phase, MSMAE precisely masks medical images via the attention maps obtained from supervised training, contributing to the representation learning of human tissue in the lesion area. During the fine-tuning phase, MSMAE is also driven by attention to the accurate masking of medical images. This improves the computational efficiency of the MSMAE while increasing the difficulty of fine-tuning, which indirectly improves the quality of MSMAE medical diagnosis. Extensive experiments demonstrate that MSMAE achieves state-of-the-art performance in case with three official medical datasets for various diseases. Meanwhile, transfer learning for MSMAE also demonstrates the great potential of our approach for medical semantic segmentation tasks. Moreover, the MSMAE accelerates the inference time in the fine-tuning phase by 11.2% and reduces the number of floating-point operations (FLOPs) by 74.08% compared to a traditional MAE.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Distributed economic predictive control of integrated energy systems for enhanced synergy and grid response: A decomposition and cooperation strategy
Authors:
Long Wu,
Xunyuan Yin,
Lei Pan,
**feng Liu
Abstract:
The close integration of increasing operating units into an integrated energy system (IES) results in complex interconnections between these units. The strong dynamic interactions create barriers to designing a successful distributed coordinated controller to achieve synergy between all the units and unlock the potential for grid response. To address these challenges, we introduce a directed graph…
▽ More
The close integration of increasing operating units into an integrated energy system (IES) results in complex interconnections between these units. The strong dynamic interactions create barriers to designing a successful distributed coordinated controller to achieve synergy between all the units and unlock the potential for grid response. To address these challenges, we introduce a directed graph representation of IESs using an augmented Jacobian matrix to depict their underlying dynamics topology. By utilizing this representation, a generic subsystem decomposition method is proposed to partition the entire IES vertically based on the dynamic time scale and horizontally based on the closeness of interconnections between the operating units. Exploiting the decomposed subsystems, we develop a cooperative distributed economic model predictive control (DEMPC) with multiple global objectives that regulate the generated power at the grid's requests and satisfy the customers cooling and system economic requirements. In the DEMPC, multiple local decision-making agents cooperate sequentially and iteratively to leverage the potential across all the units for system-wide dynamic synergy. Furthermore, we discuss how subsystem decomposition impacts the design of distributed cooperation schemes for IESs and provide a control-oriented basic guideline on the optimal decomposition of complex energy systems. Extensive simulations demonstrate that the control strategies with different levels of decomposition and collaboration will lead to marked differences in the overall performance of IES. The standard control scheme based on the proposed subsystem configuration outperforms the empirical decomposition-based control benchmark by about 20%. The DEMPC architecture further improves the overall performance of the IES by about 55% compared to the benchmark.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Self-passivated freestanding superconducting oxide film for flexible electronics
Authors:
Zhuoyue Jia,
Chi Sin Tang,
**g Wu,
Changjian Li,
Wanting Xu,
Kairong Wu,
Difan Zhou,
** Yang,
Shengwei Zeng,
Zhigang Zeng,
Dengsong Zhang,
Ariando Ariando,
Mark B. H. Breese,
Chuanbing Cai,
Xinmao Yin
Abstract:
The integration of high-temperature superconducting YBa2Cu3O6+x (YBCO) into flexible electronic devices has the potential to revolutionize the technology industry. The effective preparation of high-quality flexible YBCO films therefore plays a key role in this development. We present a novel approach for transferring water-sensitive YBCO films onto flexible substrates without any buffer layer. Fre…
▽ More
The integration of high-temperature superconducting YBa2Cu3O6+x (YBCO) into flexible electronic devices has the potential to revolutionize the technology industry. The effective preparation of high-quality flexible YBCO films therefore plays a key role in this development. We present a novel approach for transferring water-sensitive YBCO films onto flexible substrates without any buffer layer. Freestanding YBCO film on a polydimethylsiloxane substrate is extracted by etching the Sr3Al2O6 sacrificial layer from the LaAlO3 substrate. In addition to the obtained freestanding YBCO thin film having a Tc of 89.1 K, the freestanding YBCO thin films under inward and outward bending conditions have Tc of 89.6 K and 88.9 K, respectively. A comprehensive characterization involving multiple experimental techniques including high-resolution transmission electron microscopy, scanning electron microscopy, Raman and X-ray Absorption Spectroscopy is conducted to investigate the morphology, structural and electronic properties of the YBCO film before and after the extraction process where it shows the preservation of the structural and superconductive properties of the freestanding YBCO virtually in its pristine state. Further investigation reveals the formation of a YBCO passivated layer serves as a protective layer which effectively preserves the inner section of the freestanding YBCO during the etching process. This work plays a key role in actualizing the fabrication of flexible oxide thin films and opens up new possibilities for a diverse range of device applications involving thin-films and low-dimensional materials.
△ Less
Submitted 6 July, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation
Authors:
Zhenhui Ye,
**zheng He,
Ziyue Jiang,
Rongjie Huang,
Jiawei Huang,
**glin Liu,
Yi Ren,
Xiang Yin,
Zejun Ma,
Zhou Zhao
Abstract:
Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it coul…
▽ More
Generating talking person portraits with arbitrary speech audio is a crucial problem in the field of digital human and metaverse. A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high temporal consistency and audio-lip accuracy; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain input condition and produce bad rendering results occasionally; 3) as for the system efficiency, the slow training and inference speed of the vanilla NeRF severely obstruct its usage in real-world applications. In this paper, we propose GeneFace++ to handle these challenges by 1) utilizing the pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process; 2) proposing a landmark locally linear embedding method to regulate the outliers in the predicted motion sequence to avoid robustness issues; 3) designing a computationally efficient NeRF-based motion-to-video renderer to achieves fast training and real-time inference. With these settings, GeneFace++ becomes the first NeRF-based method that achieves stable and real-time talking face generation with generalized audio-lip synchronization. Extensive experiments show that our method outperforms state-of-the-art baselines in terms of subjective and objective evaluation. Video samples are available at https://genefaceplusplus.github.io .
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Large and moderate deviations for empirical density fields of stochastic SEIR epidemics with vertex-dependent transition rates
Authors:
Xiaofeng Xue,
Xueting Yin
Abstract:
In this paper, we are concerned with stochastic susceptible-exposed-infected-removed epidemics on complete graphs with vertex-dependent transition rates. Large and moderate deviations of empirical density fields of our models are given. Proofs of our main results utilize exponential martingale strategies. Mathematical difficulties are mainly in checks of exponential tightness of fluctuation densit…
▽ More
In this paper, we are concerned with stochastic susceptible-exposed-infected-removed epidemics on complete graphs with vertex-dependent transition rates. Large and moderate deviations of empirical density fields of our models are given. Proofs of our main results utilize exponential martingale strategies. Mathematical difficulties are mainly in checks of exponential tightness of fluctuation density fields of our processes. As an application of our main results, moderate deviations of a family of hitting times of our processes are also given.
△ Less
Submitted 29 April, 2023;
originally announced May 2023.
-
Causal State Estimation and the Heisenberg Uncertainty Principle
Authors:
Junxin Chen,
Benjamin B. Lane,
Su Direkci,
Dhruva Ganapathy,
Xinghui Yin,
Nergis Mavalvala,
Yanbei Chen,
Vivishek Sudhir
Abstract:
The observables of a noisy quantum system can be estimated by appropriately filtering the records of their continuous measurement. Such filtering is relevant for state estimation and measurement-based quantum feedback control. It is therefore imperative that the observables estimated through a causal filter satisfy the Heisenberg uncertainty principle. In the Markovian setting, prior work implicit…
▽ More
The observables of a noisy quantum system can be estimated by appropriately filtering the records of their continuous measurement. Such filtering is relevant for state estimation and measurement-based quantum feedback control. It is therefore imperative that the observables estimated through a causal filter satisfy the Heisenberg uncertainty principle. In the Markovian setting, prior work implicitly guarantees this requirement. We show that any causal estimate of linear observables of a linear, but not necessarily Markovian, system will satisfy the uncertainty principle. In particular, this is true irrespective of any feedback control of the system and of where in the feedback loop -- inside or outside -- the measurement record is accessed. Indeed, causal estimators using the in-loop measurement record can be as precise as those using the out-of-loop record. These results clarify the role of causal estimators to a large class of quantum systems, restores the equanimity of in-loop and out-of-loop measurements in their estimation and control, and simplifies future experiments on measurement-based quantum feedback control.
△ Less
Submitted 17 October, 2023; v1 submitted 27 April, 2023;
originally announced April 2023.
-
Quantile Extreme Gradient Boosting for Uncertainty Quantification
Authors:
Xiaozhe Yin,
Masoud Fallah-Shorshani,
Rob McConnell,
Scott Fruin,
Yao-Yi Chiang,
Meredith Franklin
Abstract:
As the availability, size and complexity of data have increased in recent years, machine learning (ML) techniques have become popular for modeling. Predictions resulting from applying ML models are often used for inference, decision-making, and downstream applications. A crucial yet often overlooked aspect of ML is uncertainty quantification, which can significantly impact how predictions from mod…
▽ More
As the availability, size and complexity of data have increased in recent years, machine learning (ML) techniques have become popular for modeling. Predictions resulting from applying ML models are often used for inference, decision-making, and downstream applications. A crucial yet often overlooked aspect of ML is uncertainty quantification, which can significantly impact how predictions from models are used and interpreted.
Extreme Gradient Boosting (XGBoost) is one of the most popular ML methods given its simple implementation, fast computation, and sequential learning, which make its predictions highly accurate compared to other methods. However, techniques for uncertainty determination in ML models such as XGBoost have not yet been universally agreed among its varying applications. We propose enhancements to XGBoost whereby a modified quantile regression is used as the objective function to estimate uncertainty (QXGBoost). Specifically, we included the Huber norm in the quantile regression model to construct a differentiable approximation to the quantile regression error function. This key step allows XGBoost, which uses a gradient-based optimization algorithm, to make probabilistic predictions efficiently.
QXGBoost was applied to create 90\% prediction intervals for one simulated dataset and one real-world environmental dataset of measured traffic noise. Our proposed method had comparable or better performance than the uncertainty estimates generated for regular and quantile light gradient boosting. For both the simulated and traffic noise datasets, the overall performance of the prediction intervals from QXGBoost were better than other models based on coverage width-based criterion.
△ Less
Submitted 23 April, 2023;
originally announced April 2023.
-
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
Authors:
Jie An,
Songyang Zhang,
Harry Yang,
Sonal Gupta,
Jia-Bin Huang,
Jiebo Luo,
Xi Yin
Abstract:
We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolat…
▽ More
We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.
△ Less
Submitted 17 April, 2023; v1 submitted 17 April, 2023;
originally announced April 2023.
-
Essential role of liquid phase on melt-processed GdBCO single-grain superconductors
Authors:
Xiongfang Liu,
Xuechun Wang,
**yu He,
Yixue Fu,
Xinmao Yin,
Chuanbing Cai,
Yibing Zhang,
Difan Zhou
Abstract:
RE-Ba-Cu-O (RE denotes rare earth elements) single-grain superconductors have garnered considerable attention owning to their ability to trap strong magnetic field and self-stability for maglev. Here, we employed a modified melt-growth method by adding liquid source (LS) to provide a liquid rich environment during crystal growth. It further enables a significantly low maximum processing temperatur…
▽ More
RE-Ba-Cu-O (RE denotes rare earth elements) single-grain superconductors have garnered considerable attention owning to their ability to trap strong magnetic field and self-stability for maglev. Here, we employed a modified melt-growth method by adding liquid source (LS) to provide a liquid rich environment during crystal growth. It further enables a significantly low maximum processing temperature (Tmax) even approaching peritectic decomposition temperature. This method was referred as the liquid source rich low Tmax (LS+LTmax) growth method which combines the advantage of Top Seeded Infiltration Growth (TSIG) into Top Seeded Melt-texture Growth (TSMG). The LS+LTmax method synergistically regulates the perfect appearance and high superconducting performance in REBCO single grains. The complementary role of liquid source and low Tmax on the crystallization has been carefully investigated. Microstructure analysis demonstrates that the LS+LTmax processed GdBCO single grains show clear advantages of uniform distribution of RE3+ ions as well as RE211 particles. The inhibition of Gd211 coarsening leads to improved pining properties. GdBCO single-grain superconductors with diameter of 18 mm and 25 mm show maximum trapped magnetic field of 0.746 T and 1.140 T at 77 K. These trapped fields are significantly higher than those of conventional TSMG samples. Particularly, at grain boundaries with reduced RE211 density superior flux pinning performance has been observed. It indicates the existence of multiple pinning mechanisms at these areas. The presented strategy provides essential LS+LTmax technology for processing high performance single-grain superconductors with improved reliability which is considered important for engineering applications.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
State estimation of a carbon capture process through POD model reduction and neural network approximation
Authors:
Siyu Liu,
Xunyuan Yin,
**feng Liu
Abstract:
This paper presents an efficient approach for state estimation of post-combustion CO2 capture plants (PCCPs) by using reduced-order neural network models. The method involves extracting lower-dimensional feature vectors from high-dimensional operational data of the PCCP and constructing a reduced-order process model using proper orthogonal decomposition (POD). Multi-layer perceptron (MLP) neural n…
▽ More
This paper presents an efficient approach for state estimation of post-combustion CO2 capture plants (PCCPs) by using reduced-order neural network models. The method involves extracting lower-dimensional feature vectors from high-dimensional operational data of the PCCP and constructing a reduced-order process model using proper orthogonal decomposition (POD). Multi-layer perceptron (MLP) neural networks capture the dominant dynamics of the process and train the network parameters with low-dimensional data obtained from open-loop simulations. The proposed POD-MLP model can be used as the basis for estimating the states of PCCPs at a significantly decreased computational cost. For state estimation, a reduced-order extended Kalman filtering (EKF) scheme based on the POD-MLP model is developed. Our simulations demonstrate that the proposed POD-MLP modeling approach reduces computational complexity compared to the POD-only model for nonlinear systems. Additionally, the POD-MLP-EKF algorithm can accurately reconstruct the full state information of PCCPs while significantly improving computational efficiency compared to the EKF based on the original PCCP model.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Control invariant set enhanced reinforcement learning for process control: improved sampling efficiency and guaranteed stability
Authors:
Song Bo,
Xunyuan Yin,
**feng Liu
Abstract:
Reinforcement learning (RL) is an area of significant research interest, and safe RL in particular is attracting attention due to its ability to handle safety-driven constraints that are crucial for real-world applications of RL algorithms. This work proposes a novel approach to RL training, called control invariant set (CIS) enhanced RL, which leverages the benefits of CIS to improve stability gu…
▽ More
Reinforcement learning (RL) is an area of significant research interest, and safe RL in particular is attracting attention due to its ability to handle safety-driven constraints that are crucial for real-world applications of RL algorithms. This work proposes a novel approach to RL training, called control invariant set (CIS) enhanced RL, which leverages the benefits of CIS to improve stability guarantees and sampling efficiency. The approach consists of two learning stages: offline and online. In the offline stage, CIS is incorporated into the reward design, initial state sampling, and state reset procedures. In the online stage, RL is retrained whenever the state is outside of CIS, which serves as a stability criterion. A backup table that utilizes the explicit form of CIS is obtained to ensure the online stability. To evaluate the proposed approach, we apply it to a simulated chemical reactor. The results show a significant improvement in sampling efficiency during offline training and closed-loop stability in the online implementation.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
HDR Video Reconstruction with a Large Dynamic Dataset in Raw and sRGB Domains
Authors:
Huan**g Yue,
Yubo Peng,
Biting Yu,
Xuanwu Yin,
Zhenyu Zhou,
**gyu Yang
Abstract:
High dynamic range (HDR) video reconstruction is attracting more and more attention due to the superior visual quality compared with those of low dynamic range (LDR) videos. The availability of LDR-HDR training pairs is essential for the HDR reconstruction quality. However, there are still no real LDR-HDR pairs for dynamic scenes due to the difficulty in capturing LDR-HDR frames simultaneously. In…
▽ More
High dynamic range (HDR) video reconstruction is attracting more and more attention due to the superior visual quality compared with those of low dynamic range (LDR) videos. The availability of LDR-HDR training pairs is essential for the HDR reconstruction quality. However, there are still no real LDR-HDR pairs for dynamic scenes due to the difficulty in capturing LDR-HDR frames simultaneously. In this work, we propose to utilize a staggered sensor to capture two alternate exposure images simultaneously, which are then fused into an HDR frame in both raw and sRGB domains. In this way, we build a large scale LDR-HDR video dataset with 85 scenes and each scene contains 60 frames. Based on this dataset, we further propose a Raw-HDRNet, which utilizes the raw LDR frames as inputs. We propose a pyramid flow-guided deformation convolution to align neighboring frames. Experimental results demonstrate that 1) the proposed dataset can improve the HDR reconstruction performance on real scenes for three benchmark networks; 2) Compared with sRGB inputs, utilizing raw inputs can further improve the reconstruction quality and our proposed Raw-HDRNet is a strong baseline for raw HDR reconstruction. Our dataset and code will be released after the acceptance of this paper.
△ Less
Submitted 12 April, 2023; v1 submitted 10 April, 2023;
originally announced April 2023.
-
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision
Authors:
Mengyin Liu,
Jie Jiang,
Chao Zhu,
Xu-Cheng Yin
Abstract:
Detecting pedestrians accurately in urban scenes is significant for realistic applications like autonomous driving or video surveillance. However, confusing human-like objects often lead to wrong detections, and small scale or heavily occluded pedestrians are easily missed due to their unusual appearances. To address these challenges, only object regions are inadequate, thus how to fully utilize m…
▽ More
Detecting pedestrians accurately in urban scenes is significant for realistic applications like autonomous driving or video surveillance. However, confusing human-like objects often lead to wrong detections, and small scale or heavily occluded pedestrians are easily missed due to their unusual appearances. To address these challenges, only object regions are inadequate, thus how to fully utilize more explicit and semantic contexts becomes a key problem. Meanwhile, previous context-aware pedestrian detectors either only learn latent contexts with visual clues, or need laborious annotations to obtain explicit and semantic contexts. Therefore, we propose in this paper a novel approach via Vision-Language semantic self-supervision for context-aware Pedestrian Detection (VLPD) to model explicitly semantic contexts without any extra annotations. Firstly, we propose a self-supervised Vision-Language Semantic (VLS) segmentation method, which learns both fully-supervised pedestrian detection and contextual segmentation via self-generated explicit labels of semantic classes by vision-language models. Furthermore, a self-supervised Prototypical Semantic Contrastive (PSC) learning method is proposed to better discriminate pedestrians and other classes, based on more explicit and semantic contexts obtained from VLS. Extensive experiments on popular benchmarks show that our proposed VLPD achieves superior performances over the previous state-of-the-arts, particularly under challenging circumstances like small scale and heavy occlusion. Code is available at https://github.com/lmy98129/VLPD.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Human-like Summarization Evaluation with ChatGPT
Authors:
Mingqi Gao,
Jie Ruan,
Renliang Sun,
Xunjian Yin,
Shi** Yang,
Xiaojun Wan
Abstract:
Evaluating text summarization is a challenging problem, and existing evaluation metrics are far from satisfactory. In this study, we explored ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods on five datasets. We found that ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and bi…
▽ More
Evaluating text summarization is a challenging problem, and existing evaluation metrics are far from satisfactory. In this study, we explored ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods on five datasets. We found that ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation. Additionally, it outperformed commonly used automatic evaluation metrics on some datasets. Furthermore, we discussed the impact of different prompts, compared its performance with that of human evaluation, and analyzed the generated explanations and invalid responses.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Data-Driven Safe Controller Synthesis for Deterministic Systems: A Posteriori Method With Validation Tests
Authors:
Yu Chen,
Chao Shang,
Xiaolin Huang,
Xiang Yin
Abstract:
In this work, we investigate the data-driven safe control synthesis problem for unknown dynamic systems. We first formulate the safety synthesis problem as a robust convex program (RCP) based on notion of control barrier function. To resolve the issue of unknown system dynamic, we follow the existing approach by converting the RCP to a scenario convex program (SCP) by randomly collecting finite sa…
▽ More
In this work, we investigate the data-driven safe control synthesis problem for unknown dynamic systems. We first formulate the safety synthesis problem as a robust convex program (RCP) based on notion of control barrier function. To resolve the issue of unknown system dynamic, we follow the existing approach by converting the RCP to a scenario convex program (SCP) by randomly collecting finite samples of system trajectory. However, to improve the sample efficiency to achieve a desired confidence bound, we provide a new posteriori method with validation tests. Specifically, after collecting a set of data for the SCP, we further collect another set of independent \emph{validate data} as posterior information to test the obtained solution. We derive a new overall confidence bound for the safety of the controller that connects the original sample data, the support constraints, and the validation data. The efficiency of the proposed approach is illustrated by a case study of room temperature control. We show that, compared with existing methods, the proposed approach can significantly reduce the required number of sample data to achieve a desired confidence bound.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization
Authors:
Mingze Yuan,
Yingda Xia,
Hexin Dong,
Zifan Chen,
Jiawen Yao,
Mingyan Qiu,
Ke Yan,
Xiaoli Yin,
Yu Shi,
Xin Chen,
Zaiyi Liu,
Bin Dong,
**gren Zhou,
Le Lu,
Ling Zhang,
Li Zhang
Abstract:
Real-world medical image segmentation has tremendous long-tailed complexity of objects, among which tail conditions correlate with relatively rare diseases and are clinically significant. A trustworthy medical AI algorithm should demonstrate its effectiveness on tail conditions to avoid clinically dangerous damage in these out-of-distribution (OOD) cases. In this paper, we adopt the concept of obj…
▽ More
Real-world medical image segmentation has tremendous long-tailed complexity of objects, among which tail conditions correlate with relatively rare diseases and are clinically significant. A trustworthy medical AI algorithm should demonstrate its effectiveness on tail conditions to avoid clinically dangerous damage in these out-of-distribution (OOD) cases. In this paper, we adopt the concept of object queries in Mask Transformers to formulate semantic segmentation as a soft cluster assignment. The queries fit the feature-level cluster centers of inliers during training. Therefore, when performing inference on a medical image in real-world scenarios, the similarity between pixels and the queries detects and localizes OOD regions. We term this OOD localization as MaxQuery. Furthermore, the foregrounds of real-world medical images, whether OOD objects or inliers, are lesions. The difference between them is less than that between the foreground and background, possibly misleading the object queries to focus redundantly on the background. Thus, we propose a query-distribution (QD) loss to enforce clear boundaries between segmentation targets and other regions at the query level, improving the inlier segmentation and OOD indication. Our proposed framework is tested on two real-world segmentation tasks, i.e., segmentation of pancreatic and liver tumors, outperforming previous state-of-the-art algorithms by an average of 7.39% on AUROC, 14.69% on AUPR, and 13.79% on FPR95 for OOD localization. On the other hand, our framework improves the performance of inlier segmentation by an average of 5.27% DSC when compared with the leading baseline nnUNet.
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
MaLP: Manipulation Localization Using a Proactive Scheme
Authors:
Vishal Asnani,
Xi Yin,
Tal Hassner,
Xiaoming Liu
Abstract:
Advancements in the generation quality of various Generative Models (GMs) has made it necessary to not only perform binary manipulation detection but also localize the modified pixels in an image. However, prior works termed as passive for manipulation localization exhibit poor generalization performance over unseen GMs and attribute modifications. To combat this issue, we propose a proactive sche…
▽ More
Advancements in the generation quality of various Generative Models (GMs) has made it necessary to not only perform binary manipulation detection but also localize the modified pixels in an image. However, prior works termed as passive for manipulation localization exhibit poor generalization performance over unseen GMs and attribute modifications. To combat this issue, we propose a proactive scheme for manipulation localization, termed MaLP. We encrypt the real images by adding a learned template. If the image is manipulated by any GM, this added protection from the template not only aids binary detection but also helps in identifying the pixels modified by the GM. The template is learned by leveraging local and global-level features estimated by a two-branch architecture. We show that MaLP performs better than prior passive works. We also show the generalizability of MaLP by testing on 22 different GMs, providing a benchmark for future research on manipulation localization. Finally, we show that MaLP can be used as a discriminator for improving the generation quality of GMs. Our models/codes are available at www.github.com/vishal3477/pro_loc.
△ Less
Submitted 4 April, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Potential quantum advantage for simulation of fluid dynamics
Authors:
Xiangyu Li,
Xiaolong Yin,
Nathan Wiebe,
Jaehun Chun,
Gregory K. Schenter,
Margaret S. Cheung,
Johannes Mülmenstädt
Abstract:
Numerical simulation of turbulent fluid dynamics needs to either parameterize turbulence-which introduces large uncertainties-or explicitly resolve the smallest scales-which is prohibitively expensive. Here we provide evidence through analytic bounds and numerical studies that a potential quantum exponential speedup can be achieved to simulate the Navier-Stokes equations governing turbulence using…
▽ More
Numerical simulation of turbulent fluid dynamics needs to either parameterize turbulence-which introduces large uncertainties-or explicitly resolve the smallest scales-which is prohibitively expensive. Here we provide evidence through analytic bounds and numerical studies that a potential quantum exponential speedup can be achieved to simulate the Navier-Stokes equations governing turbulence using quantum computing. Specifically, we provide a formulation of the lattice Boltzmann equation for which we give evidence that low-order Carleman linearization is much more accurate than previously believed for these systems and that for computationally interesting examples. This is achieved via a combination of reformulating the nonlinearity and accurately linearizing the dynamical equations, effectively trading nonlinearity for additional degrees of freedom that add negligible expense in the quantum solver. Based on this we apply a quantum algorithm for simulating the Carleman-linearized lattice Boltzmann equation and provide evidence that its cost scales logarithmically with system size, compared to polynomial scaling in the best known classical algorithms. This work suggests that an exponential quantum advantage may exist for simulating fluid dynamics, paving the way for simulating nonlinear multiscale transport phenomena in a wide range of disciplines using quantum computing.
△ Less
Submitted 28 March, 2024; v1 submitted 29 March, 2023;
originally announced March 2023.
-
STCF Conceptual Design Report: Volume 1 -- Physics & Detector
Authors:
M. Achasov,
X. C. Ai,
R. Aliberti,
L. P. An,
Q. An,
X. Z. Bai,
Y. Bai,
O. Bakina,
A. Barnyakov,
V. Blinov,
V. Bobrovnikov,
D. Bodrov,
A. Bogomyagkov,
A. Bondar,
I. Boyko,
Z. H. Bu,
F. M. Cai,
H. Cai,
J. J. Cao,
Q. H. Cao,
Z. Cao,
Q. Chang,
K. T. Chao,
D. Y. Chen,
H. Chen
, et al. (413 additional authors not shown)
Abstract:
The Super $Ï„$-Charm facility (STCF) is an electron-positron collider proposed by the Chinese particle physics community. It is designed to operate in a center-of-mass energy range from 2 to 7 GeV with a peak luminosity of $0.5\times 10^{35}{\rm cm}^{-2}{\rm s}^{-1}$ or higher. The STCF will produce a data sample about a factor of 100 larger than that by the present $Ï„$-Charm factory -- the BEPCII,…
▽ More
The Super $Ï„$-Charm facility (STCF) is an electron-positron collider proposed by the Chinese particle physics community. It is designed to operate in a center-of-mass energy range from 2 to 7 GeV with a peak luminosity of $0.5\times 10^{35}{\rm cm}^{-2}{\rm s}^{-1}$ or higher. The STCF will produce a data sample about a factor of 100 larger than that by the present $Ï„$-Charm factory -- the BEPCII, providing a unique platform for exploring the asymmetry of matter-antimatter (charge-parity violation), in-depth studies of the internal structure of hadrons and the nature of non-perturbative strong interactions, as well as searching for exotic hadrons and physics beyond the Standard Model. The STCF project in China is under development with an extensive R\&D program. This document presents the physics opportunities at the STCF, describes conceptual designs of the STCF detector system, and discusses future plans for detector R\&D and physics case studies.
△ Less
Submitted 5 October, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.
-
The $Z$ resonance, inelastic dark matter, and new physics anomalies in the Simple Extension of the Standard Model (SESM) with general scalar potential
Authors:
Wenxing Zhang,
Tianjun Li,
Xiangwei Yin
Abstract:
We consider the generic scalar potential with CP-violation, and study the $Z$ resonance and inelastic dark matter in the Simple Extension of the Standard Model (SESM), which can explain the dark matter as well as new physics anomalies such as the B physics anomalies and muon anomalous magnetic moment, etc. With the new scalar potential terms, we obtain the mass splittings for the real and imaginar…
▽ More
We consider the generic scalar potential with CP-violation, and study the $Z$ resonance and inelastic dark matter in the Simple Extension of the Standard Model (SESM), which can explain the dark matter as well as new physics anomalies such as the B physics anomalies and muon anomalous magnetic moment, etc. With the new scalar potential terms, we obtain the mass splittings for the real and imaginary parts of scalar fields. And thus we can have the DM co-annihilation process mediated by $Z$ boson, which couples exclusively to the CP-even and CP-odd parts of scalar fields. This is a brand new feature compared to the previous study. For the CP conserving case, we present the viable parameter space for the Higgs and $Z$ resonances, which can explain the B physics anomalies, muon anomalous magnetic moment, and dark matter relic density, as well as evade the constraint from the XENON1T direct detection simultaneously. For the CP-violating case, we consider the inelastic dark matter, and study four concrete scenarios for the inelastic DM-nucleon scatterings mediated by the Higgs and $Z$ bosons in details. Also, we present the benchmark points which satisfy the aforementioned constraints. Furthermore, we investigate the constraints from the dark matter-electron inelastic scattering processes mediated by the Higgs and $Z$ bosons in light of the XENONnT data. We show that the constraint on the $Z$ mediated process is weak, while the Higgs mediated process excludes the dark matter with mass around several MeV.
△ Less
Submitted 5 October, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Giant-atom entanglement in waveguide-QED systems including non-Markovian effect
Authors:
Xian-Li Yin,
Jie-Qiao Liao
Abstract:
We study the generation of quantum entanglement between two giant atoms coupled to a common one-dimensional waveguide. Here each giant atom interacts with the waveguide at two separate coupling points. Within the Wigner-Weisskopf framework for single coupling points, we obtain the time-delayed quantum master equations governing the evolution of the two giant atoms for three different coupling conf…
▽ More
We study the generation of quantum entanglement between two giant atoms coupled to a common one-dimensional waveguide. Here each giant atom interacts with the waveguide at two separate coupling points. Within the Wigner-Weisskopf framework for single coupling points, we obtain the time-delayed quantum master equations governing the evolution of the two giant atoms for three different coupling configurations: separated, braided, and nested couplings. For each coupling configuration, we consider both the Markovian and non-Markovian entanglement dynamics of the giant atoms, which are initially in two different separable states: single- and double-excitation states. Our results show that the generated entanglement depends on the phase shift, time delay, atomic initial state, and the coupling configuration. For the single-excitation initial state, there exists the steady-state entanglement for each coupling in both the Markovian and non-Markovian regimes due to the appearance of the dark state. For the double-excitation initial state, we observe entanglement sudden birth via adjusting the phase shift in both regimes. In particular, the maximally achievable entanglement for the nested coupling is about one order of magnitude larger than those of separate and braided couplings. We also find that the maximal entanglement for these three coupling configurations can be enhanced in the case of small time delays. This work can be utilized for the generation and control of entanglement in quantum networks based on giant-atom waveguide-QED systems, which have wide potential applications in quantum information processing.
△ Less
Submitted 8 June, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Star-Net: Improving Single Image Desnowing Model With More Efficient Connection and Diverse Feature Interaction
Authors:
Jiawei Mao,
Yuanqi Chang,
Xuesong Yin,
Binling Nie
Abstract:
Compared to other severe weather image restoration tasks, single image desnowing is a more challenging task. This is mainly due to the diversity and irregularity of snow shape, which makes it extremely difficult to restore images in snowy scenes. Moreover, snow particles also have a veiling effect similar to haze or mist. Although current works can effectively remove snow particles with various sh…
▽ More
Compared to other severe weather image restoration tasks, single image desnowing is a more challenging task. This is mainly due to the diversity and irregularity of snow shape, which makes it extremely difficult to restore images in snowy scenes. Moreover, snow particles also have a veiling effect similar to haze or mist. Although current works can effectively remove snow particles with various shapes, they also bring distortion to the restored image. To address these issues, we propose a novel single image desnowing network called Star-Net. First, we design a Star type Skip Connection (SSC) to establish information channels for all different scale features, which can deal with the complex shape of snow particles.Second, we present a Multi-Stage Interactive Transformer (MIT) as the base module of Star-Net, which is designed to better understand snow particle shapes and to address image distortion by explicitly modeling a variety of important image recovery features. Finally, we propose a Degenerate Filter Module (DFM) to filter the snow particle and snow fog residual in the SSC on the spatial and channel domains. Extensive experiments show that our Star-Net achieves state-of-the-art snow removal performances on three standard snow removal datasets and retains the original sharpness of the images.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
An in-depth exploration of LAMOST Unknown spectra based on density clustering
Authors:
Haifeng Yang,
Xiaona Yin,
Jianghui Cai,
Yuqing Yang,
Ali Luo,
Zhongrui Bai,
Lichan Zhou,
Xujun Zhao,
Yaling Xun
Abstract:
LAMOST (Large Sky Area Multi-Object Fiber Spectroscopic Telescope) has completed the observation of nearly 20 million celestial objects, including a class of spectra labeled `Unknown'. Besides low signal-to-noise ratio, these spectra often show some anomalous features that do not work well with current templates. In this paper, a total of 638,000 `Unknown' spectra from LAMOST DR5 are selected, and…
▽ More
LAMOST (Large Sky Area Multi-Object Fiber Spectroscopic Telescope) has completed the observation of nearly 20 million celestial objects, including a class of spectra labeled `Unknown'. Besides low signal-to-noise ratio, these spectra often show some anomalous features that do not work well with current templates. In this paper, a total of 638,000 `Unknown' spectra from LAMOST DR5 are selected, and an unsupervised-based analytical framework of `Unknown' spectra named SA-Frame (Spectra Analysis-Frame) is provided to explore their origins from different perspectives. The SA-Frame is composed of three parts: NAPC-Spec clustering, characterization and origin analysis. First, NAPC-Spec(Nonparametric density clustering algorithm for spectra) characterizes different features in the "unknown" spectrum by adjusting the influence space and divergence distance to minimize the effects of noise and high dimensionality, resulting in 13 types. Second, characteristic extraction and representation of clustering results are carried out based on spectral lines and continuum, where these 13 types are characterized as regular spectra with low S/Ns, splicing problems, suspected galactic emission signals, contamination from city light and un-gregarious type respectively. Third, a preliminary analysis of their origins is made from the characteristics of the observational targets, contamination from the sky, and the working status of the instruments. These results would be valuable for improving the overall data quality of large-scale spectral surveys.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
Sensor network design for post-combustion CO2 capture plants: economy, complexity and robustness
Authors:
Siyu Liu,
Xunyuan Yin,
**feng Liu
Abstract:
State estimation is crucial for the monitoring and control of post-combustion CO2 capture plants (PCCPs). The performance of state estimation is highly reliant on the configuration of sensors. In this work, we consider the problem of sensor selection for PCCPs and propose a computationally efficient method to determine an appropriate number of sensors and the corresponding placement of the sensors…
▽ More
State estimation is crucial for the monitoring and control of post-combustion CO2 capture plants (PCCPs). The performance of state estimation is highly reliant on the configuration of sensors. In this work, we consider the problem of sensor selection for PCCPs and propose a computationally efficient method to determine an appropriate number of sensors and the corresponding placement of the sensors. The objective is to find the (near-)optimal set of sensors that provides the maximum degree of observability for state estimation while satisfying the budget constraint. Specifically, we resort to the information contained in the sensitivity matrix calculated around the operating region of a PCCP to quantify the degree of observability of the entire system corresponding to the placed sensors. The sensor selection problem is converted to an optimization problem, and is efficiently solved by a one-by-one removal approach through sensitivity analysis. Next, we extend our approach to study fault tolerance (resilience) of the selected sensors to sensor malfunction. The resilient sensor selection problem is to find a sensor network that gives good estimation performance even when some of the sensors fail, thereby improving the overall system robustness. The resilient sensor selection problem is formulated as a max-min optimization problem. We show how the proposed approach can be adapted to solve the sensor selection max-min optimization problem. By implementing the proposed approaches, the sensor network is configured for the PCCP efficiently.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion
Authors:
Chunfeng Wang,
Peisong Huang,
Yuxiang Zou,
Haoyu Zhang,
Shichao Liu,
Xiang Yin,
Zejun Ma
Abstract:
As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of b…
▽ More
As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of both expert knowledge and connectionist temporal classification (CTC) based neural network and propose a novel method named LiteG2P which is fast, light and theoretically parallel. With the carefully leading design, LiteG2P can be applied both on cloud and on device. Experimental results on the CMU dataset show that the performance of the proposed method is superior to the state-of-the-art CTC based method with 10 times fewer parameters, and even comparable to the state-of-the-art Transformer-based sequence-to-sequence model with less parameters and 33 times less computation.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Boosting indirect detection of a secluded dark matter sector
Authors:
**mian Li,
Takaaki Nomura,
Junle Pei,
Xiangwei Yin,
Cong Zhang
Abstract:
Dark Matter (DM) residing in a secluded sector with suppressed portal interaction could evade direct detections and collider searches. The indirect detections provide the most robust probe to this scenario. Depending on the structure of the dark sector, novel DM annihilation spectra are possible. The dark shower is a common phenomenon for particles in the dark sector which take part in strong inte…
▽ More
Dark Matter (DM) residing in a secluded sector with suppressed portal interaction could evade direct detections and collider searches. The indirect detections provide the most robust probe to this scenario. Depending on the structure of the dark sector, novel DM annihilation spectra are possible. The dark shower is a common phenomenon for particles in the dark sector which take part in strong interactions and are boosted. In terms of simplified two-component DM models with vector portal interaction and pseudoscalar portal interaction, we study the dark showering effects for DM indirect detection. In those models, the heavier DM component which dominates the relic density annihilates into boosted lighter species. Together with the large coupling through which the lighter DM annihilates away in the early universe, the showered spectra provide as the smoking gun for the DM existence. Considering bounds obtained by the AMS-02 positron data and Fermi-LAT measurement of gamma-ray from the dwarf galaxies, we find the dark shower could open a new region of sensitivity that can not be probed before.
△ Less
Submitted 11 August, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Spin State Disproportionation in Insulating Ferromagnetic LaCoO3 Epitaxial Thin Films
Authors:
Shanquan Chen,
Jhong-Yi Chang,
Qinghua Zhang,
Qiuyue Li,
Ting Lin,
Fanqi Meng,
Haoliang Huang,
Shengwei Zeng,
Xinmao Yin,
My Ngoc Duong,
Yalin Lu,
Lang Chen,
Er-Jia Guo,
Hanghui Chen,
Chun-Fu Chang,
Chang-Yang Kuo,
Zuhuang Chen
Abstract:
The origin of insulating ferromagnetism in epitaxial LaCoO3 films under tensile strain remains elusive despite extensive research efforts have been devoted. Surprisingly, the spin state of its Co ions, the main parameter of its ferromagnetism, is still to be determined. Here, we have systematically investigated the spin state in epitaxial LaCoO3 thin films to clarify the mechanism of strain induce…
▽ More
The origin of insulating ferromagnetism in epitaxial LaCoO3 films under tensile strain remains elusive despite extensive research efforts have been devoted. Surprisingly, the spin state of its Co ions, the main parameter of its ferromagnetism, is still to be determined. Here, we have systematically investigated the spin state in epitaxial LaCoO3 thin films to clarify the mechanism of strain induced ferromagnetism using element-specific x-ray absorption spectroscopy and dichroism. Combining with the configuration interaction cluster calculations, we unambiguously demonstrate that Co3+ in LaCoO3 films under compressive strain (on LaAlO3 substrate) are practically a low spin state, whereas Co3+ in LaCoO3 films under tensile strain (on SrTiO3 substrate) have mixed high spin and low spin states with a ratio close to 1:3. From the identification of this spin state ratio, we infer that the dark strips observed by high-resolution scanning transmission electron microscopy indicate the position of Co3+ high spin state, i.e., an observation of a spin state disproportionation in tensile-strained LaCoO3 films. This consequently explains the nature of ferromagnetism in LaCoO3 films.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Research on data integration of overseas discrete archives from the perspective of digital humanties
Authors:
Rina Su,
2. Yumeng Li,
Xin Yang,
Xin Yin,
Tao Chen
Abstract:
The digitization of displaced archives is of great historical and cultural significance. Through the construction of digital humanistic platforms represented by MISS platform, and the comprehensive application of IIIF technology, knowledge graph technology, ontology technology, and other popular information technologies. We can find that the digital framework of displaced archives built through th…
▽ More
The digitization of displaced archives is of great historical and cultural significance. Through the construction of digital humanistic platforms represented by MISS platform, and the comprehensive application of IIIF technology, knowledge graph technology, ontology technology, and other popular information technologies. We can find that the digital framework of displaced archives built through the MISS platform can promote the establishment of a standardized cooperation and dialogue mechanism between the archives authoritiess and other government departments. At the same time, it can embed the works o fichives ction of digital government and the economy, promote the exploration of the integration of archives management, data management, and information resource management, and ultimately promote the construction of a digital society. By fostering a new partnership between archives departments and enterprises, think tanks, research institutes, and industry associations, the role of multiple social subjects in the modernization process of the archives governance system and governance capacity will be brought into play. The National Archives Administration has launched a special operation to recover scattered archives overseas, drawing up a list and a recovery action plan for archives lost to overseas institutions and individuals due to war and other reasons. Through the National Archives Administration, the State Administration of Cultural Heritage, the Ministry of Foreign Affairs, the Supreme People's Court, the Supreme People's Procuratorate, and the Ministry of Justice, specific recovery work is carried out by studying and working on international laws.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Authors:
Rongjie Huang,
Jiawei Huang,
Dongchao Yang,
Yi Ren,
Lu** Liu,
Mingze Li,
Zhenhui Ye,
**glin Liu,
Xiang Yin,
Zhou Zhao
Abstract:
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses t…
▽ More
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io
△ Less
Submitted 29 January, 2023;
originally announced January 2023.
-
CancerUniT: Towards a Single Unified Model for Effective Detection, Segmentation, and Diagnosis of Eight Major Cancers Using a Large Collection of CT Scans
Authors:
Jieneng Chen,
Yingda Xia,
Jiawen Yao,
Ke Yan,
Jianpeng Zhang,
Le Lu,
Fakai Wang,
Bo Zhou,
Mingyan Qiu,
Qihang Yu,
Mingze Yuan,
Wei Fang,
Yuxing Tang,
Minfeng Xu,
Jian Zhou,
Yuqian Zhao,
Qifeng Wang,
Xianghua Ye,
Xiaoli Yin,
Yu Shi,
Xin Chen,
**gren Zhou,
Alan Yuille,
Zaiyi Liu,
Ling Zhang
Abstract:
Human readers or radiologists routinely perform full-body multi-organ multi-disease detection and diagnosis in clinical practice, while most medical AI systems are built to focus on single organs with a narrow list of a few diseases. This might severely limit AI's clinical adoption. A certain number of AI models need to be assembled non-trivially to match the diagnostic process of a human reading…
▽ More
Human readers or radiologists routinely perform full-body multi-organ multi-disease detection and diagnosis in clinical practice, while most medical AI systems are built to focus on single organs with a narrow list of a few diseases. This might severely limit AI's clinical adoption. A certain number of AI models need to be assembled non-trivially to match the diagnostic process of a human reading a CT scan. In this paper, we construct a Unified Tumor Transformer (CancerUniT) model to jointly detect tumor existence & location and diagnose tumor characteristics for eight major cancers in CT scans. CancerUniT is a query-based Mask Transformer model with the output of multi-tumor prediction. We decouple the object queries into organ queries, tumor detection queries and tumor diagnosis queries, and further establish hierarchical relationships among the three groups. This clinically-inspired architecture effectively assists inter- and intra-organ representation learning of tumors and facilitates the resolution of these complex, anatomically related multi-organ cancer image reading tasks. CancerUniT is trained end-to-end using a curated large-scale CT images of 10,042 patients including eight major types of cancers and occurring non-cancer tumors (all are pathology-confirmed with 3D tumor masks annotated by radiologists). On the test set of 631 patients, CancerUniT has demonstrated strong performance under a set of clinically relevant evaluation metrics, substantially outperforming both multi-disease methods and an assembly of eight single-organ expert models in tumor detection, segmentation, and diagnosis. This moves one step closer towards a universal high performance cancer screening tool.
△ Less
Submitted 6 October, 2023; v1 submitted 28 January, 2023;
originally announced January 2023.
-
A Survey on Approximate Multiplier Designs for Energy Efficiency: From Algorithms to Circuits
Authors:
Ying Wu,
Chuangtao Chen,
Weihua Xiao,
Xuan Wang,
Chenyi Wen,
Jie Han,
Xunzhao Yin,
Weikang Qian,
Cheng Zhuo
Abstract:
Given the stringent requirements of energy efficiency for Internet-of-Things edge devices, approximate multipliers, as a basic component of many processors and accelerators, have been constantly proposed and studied for decades, especially in error-resilient applications. The computation error and energy efficiency largely depend on how and where the approximation is introduced into a design. Thus…
▽ More
Given the stringent requirements of energy efficiency for Internet-of-Things edge devices, approximate multipliers, as a basic component of many processors and accelerators, have been constantly proposed and studied for decades, especially in error-resilient applications. The computation error and energy efficiency largely depend on how and where the approximation is introduced into a design. Thus, this article aims to provide a comprehensive review of the approximation techniques in multiplier designs ranging from algorithms and architectures to circuits. We have implemented representative approximate multiplier designs in each category to understand the impact of the design techniques on accuracy and efficiency. The designs can then be effectively deployed in high-level applications, such as machine learning, to gain energy efficiency at the cost of slight accuracy loss.
△ Less
Submitted 29 June, 2023; v1 submitted 28 January, 2023;
originally announced January 2023.