Search | arXiv e-print repository

doi 10.1109/ASRU57964.2023.10389743

Two-pass Endpoint Detection for Speech Recognition

Authors: Anirudh Raju, Aparna Khare, Di He, Ilya Sklyar, Long Chen, Sam Alptekin, Viet Anh Trinh, Zhe Zhang, Colin Vaz, Venkatesh Ravichandran, Roland Maas, Ariya Rastrow

Abstract: Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified b… ▽ More Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: ASRU 2023

arXiv:2303.15132 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096820

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Authors: Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran

Abstract: We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK da… ▽ More We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: To appear in IEEE ICASSP 2023

Journal ref: Proc. IEEE ICASSP, June 2023

arXiv:2303.00692 [pdf, other]

Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Authors: Feng-Ju Chang, Anastasios Alexandridis, Rupak Vignesh Swaminathan, Martin Radfar, Harish Mallidi, Maurizio Omologo, Athanasios Mouchtaris, Brian King, Roland Maas

Abstract: To achieve robust far-field automatic speech recognition (ASR), existing techniques typically employ an acoustic front end (AFE) cascaded with a neural transducer (NT) ASR model. The AFE output, however, could be unreliable, as the beamforming output in AFE is steered to a wrong direction. A promising way to address this issue is to exploit the microphone signals before the beamforming stage and a… ▽ More To achieve robust far-field automatic speech recognition (ASR), existing techniques typically employ an acoustic front end (AFE) cascaded with a neural transducer (NT) ASR model. The AFE output, however, could be unreliable, as the beamforming output in AFE is steered to a wrong direction. A promising way to address this issue is to exploit the microphone signals before the beamforming stage and after the acoustic echo cancellation (post-AEC) in AFE. We argue that both, post-AEC and AFE outputs, are complementary and it is possible to leverage the redundancy between these signals to compensate for potential AFE processing errors. We present two fusion networks to explore this redundancy and aggregate these multi-channel (MC) signals: (1) Frequency-LSTM based, and (2) Convolutional Neural Network based fusion networks. We augment the MC fusion networks to a conformer transducer model and train it in an end-to-end fashion. Our experimental results on commercial virtual assistant tasks demonstrate that using the AFE output and two post-AEC signals with fusion networks offers up to 25.9% word error rate (WER) relative improvement over the model using the AFE output only, at the cost of <= 2% parameter increase. △ Less

Submitted 1 March, 2023; originally announced March 2023.

arXiv:2210.12335 [pdf, other]

doi 10.1109/SLT54892.2023.10022676

Guided contrastive self-supervised pre-training for automatic speech recognition

Authors: Aparna Khare, Minhua Wu, Saurabhchand Bhati, Jasha Droppo, Roland Maas

Abstract: Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model. It can be used to effectively initialize the encoder of an Automatic Speech Recognition (ASR) model. We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC). Our proposed method m… ▽ More Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model. It can be used to effectively initialize the encoder of an Automatic Speech Recognition (ASR) model. We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC). Our proposed method maximizes the mutual information between representations from a prior-knowledge model and the output of the model being pre-trained, allowing prior knowledge injection during pre-training. We validate our method on 3 ASR tasks: German, French and English. Our method outperforms CPC pre-training on all three datasets, reducing the Word Error Rate (WER) by 4.44%, 6.55% and 15.43% relative on the German, French and English (Librispeech) tasks respectively, compared to training from scratch, while CPC pre-training only brings 2.96%, 1.01% and 14.39% relative WER reduction respectively. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: To appear in SLT 2022

arXiv:2207.07850 [pdf, other]

doi 10.21437/Interspeech.2022-11063

Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

Authors: Viet Anh Trinh, Pegah Ghahremani, Brian King, Jasha Droppo, Andreas Stolcke, Roland Maas

Abstract: We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from t… ▽ More We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from the previous optimal values, which can lead to worse performance in other regions. In our proposed method, we utilize the elastic weight consolidation (EWC) regularization loss to identify directions in parameters space along which the ASR weights can vary to improve for high-error regions, while still maintaining performance on the speaker population overall. Our results demonstrate that EWC can reduce the word error rate (WER) in the region with highest WER by 3.2% relative while reducing the overall WER by 1.3% relative. We also evaluate the role of language and acoustic models in ASR fairness and propose a clustering algorithm to identify WER disparities based on geographic region. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: Accepted for publication at Interspeech 2022

Journal ref: Proc. Interspeech, Sept. 2022, pp. 1298-1302

arXiv:2204.01013 [pdf, other]

doi 10.1063/5.0075911

Zero absolute vorticity plane Couette flow as an hydrodynamic representation of quantum energy states under perpendicular magnetic field

Authors: Eyal Heifetz, Leo R. M. Maas, Julian Mak

Abstract: Here we extend the Madelung transformation of the Schrödinger equation into a fluid-like form to include the influence of an external electromagnetic field on a charged particle. The vorticity of the Madelung fluid is then in the opposite direction to the imposed magnetic field and equal in magnitude to the cyclotron angular frequency. When the particle motion is confined to a plane, perpendicular… ▽ More Here we extend the Madelung transformation of the Schrödinger equation into a fluid-like form to include the influence of an external electromagnetic field on a charged particle. The vorticity of the Madelung fluid is then in the opposite direction to the imposed magnetic field and equal in magnitude to the cyclotron angular frequency. When the particle motion is confined to a plane, perpendicular to an imposed magnetic field, the equivalent flow dynamics is that of zero absolute vorticity obtained in a quasi 2D rotating frame, where the cyclotron frequency plays a role equivalent to that of the Coriolis frequency in a rotating frame. We show how the Landau levels and the extended modes in the integer quantum Hall effect are all mapped into such zero absolute vorticity-like plane Couette flows, where the latter exhibit a geostrophic-like balance between the magnetic force and the gradients of the quantum (Bohm) potential and the electric force. △ Less

Submitted 3 April, 2022; originally announced April 2022.

Comments: 17 pages, 3 figures; preprint version here; published in Physics of Fluids

Journal ref: Physics of Fluids 33, 127120 (2021)

arXiv:2202.10593 [pdf, other]

VADOI:Voice-Activity-Detection Overlap** Inference For End-to-end Long-form Speech Recognition

Authors: **han Wang, Xiaosu Tong, **xi Guo, Di He, Roland Maas

Abstract: While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlap** inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlap** percentage decreases. Setting aside computational… ▽ More While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlap** inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlap** percentage decreases. Setting aside computational cost, the setup with 50% overlap** during inference can achieve the best performance. However, a lower overlap** percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlap** inference and partial overlap** inference with various configurations. We then propose Voice-Activity-Detection Overlap** Inference to provide a trade-off between WER and computation cost. Results show that the proposed method can achieve a 20% relative computation cost reduction on Librispeech and Microsoft Speech Language Translation long-form corpus while maintaining the WER performance when comparing to the best performing overlap** inference algorithm. We also propose Soft-Match to compensate for similar words mis-aligned problem. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2108.06213 [pdf, other]

doi 10.1088/2399-6528/ac3eec

On a formal equivalence between electro-magnetic waves in cold plasma and shallow water inertio-gravity waves

Authors: Eyal Heifetz, Leo R. M. Maas, Julian Mak, Ishay Pomerantz

Abstract: The fundamental dispersion relation of transverse electro-magnetic waves in a cold collisionless plasma is formally equivalent to the two dimensional dispersion relation of inertio-gravity waves in a rotating shallow water system, where the Coriolis frequency can be identified with the plasma frequency, and the shallow water gravity wave phase speed plays the role of the speed of light. Here we ex… ▽ More The fundamental dispersion relation of transverse electro-magnetic waves in a cold collisionless plasma is formally equivalent to the two dimensional dispersion relation of inertio-gravity waves in a rotating shallow water system, where the Coriolis frequency can be identified with the plasma frequency, and the shallow water gravity wave phase speed plays the role of the speed of light. Here we examine this formal equivalence in the governing linearised equations, and compare between the propagation wave mechanisms in these seemingly unrelated physical systems. △ Less

Submitted 13 August, 2021; originally announced August 2021.

Comments: 13 pages, 5 figures; submitted to Journal of Plasma Physics

arXiv:2106.07803 [pdf, other]

SynthASR: Unlocking Synthetic Data for Speech Recognition

Authors: Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, Jasha Droppo

Abstract: End-to-end (E2E) automatic speech recognition (ASR) models have recently demonstrated superior performance over the traditional hybrid ASR models. Training an E2E ASR model requires a large amount of data which is not only expensive but may also raise dependency on production data. At the same time, synthetic speech generated by the state-of-the-art text-to-speech (TTS) engines has advanced to nea… ▽ More End-to-end (E2E) automatic speech recognition (ASR) models have recently demonstrated superior performance over the traditional hybrid ASR models. Training an E2E ASR model requires a large amount of data which is not only expensive but may also raise dependency on production data. At the same time, synthetic speech generated by the state-of-the-art text-to-speech (TTS) engines has advanced to near-human naturalness. In this work, we propose to utilize synthetic speech for ASR training (SynthASR) in applications where data is sparse or hard to get for ASR model training. In addition, we apply continual learning with a novel multi-stage training strategy to address catastrophic forgetting, achieved by a mix of weighted multi-style training, data augmentation, encoder freezing, and parameter regularization. In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio via the proposed multi-stage training improved the recognition performance on new application by more than 65% relative, without degradation on existing general applications. Our observations show that SynthASR holds great promise in training the state-of-the-art large-scale E2E ASR models for new applications while reducing the costs and dependency on production data. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2106.02750 [pdf, other]

Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Authors: Gokce Keskin, Minhua Wu, Brian King, Harish Mallidi, Yang Gao, Jasha Droppo, Ariya Rastrow, Roland Maas

Abstract: Automatic speech recognition (ASR) models are typically designed to operate on a single input data type, e.g. a single or multi-channel audio streamed from a device. This design decision assumes the primary input data source does not change and if an additional (auxiliary) data source is occasionally available, it cannot be used. An ASR model that operates on both primary and auxiliary data can ac… ▽ More Automatic speech recognition (ASR) models are typically designed to operate on a single input data type, e.g. a single or multi-channel audio streamed from a device. This design decision assumes the primary input data source does not change and if an additional (auxiliary) data source is occasionally available, it cannot be used. An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable. In this work, we propose a unified ASR model that can serve both modes. We demonstrate its efficacy in a realistic scenario where a set of devices typically stream a single primary audio channel, and two additional auxiliary channels only when upload bandwidth allows it. The architecture enables a unique methodology that uses both types of input audio during training time. Our proposed approach achieves up to 12.5% relative word-error-rate reduction (WERR) compared to a PO baseline, and up to 16.0% relative WERR in low-SNR conditions. The unique training methodology achieves up to 2.5% relative WERR compared to a PPA baseline. △ Less

Submitted 28 June, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

arXiv:2105.05920 [pdf, ps, other]

Attention-based Neural Beamforming Layers for Multi-channel Speech Recognition

Authors: Bhargav Pulugundla, Yang Gao, Brian King, Gokce Keskin, Harish Mallidi, Minhua Wu, Jasha Droppo, Roland Maas

Abstract: Attention-based beamformers have recently been shown to be effective for multi-channel speech recognition. However, they are less capable at capturing local information. In this work, we propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming. We apply self- and cross-attention to explicitly model the correlations within and between the input ch… ▽ More Attention-based beamformers have recently been shown to be effective for multi-channel speech recognition. However, they are less capable at capturing local information. In this work, we propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming. We apply self- and cross-attention to explicitly model the correlations within and between the input channels. The end-to-end 2D Conv-Attention model is compared with a multi-head self-attention and superdirective-based neural beamformers. We train and evaluate on an in-house multi-channel dataset. The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer. △ Less

Submitted 14 May, 2021; v1 submitted 12 May, 2021; originally announced May 2021.

arXiv:2103.08393 [pdf, other]

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Authors: Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas

Abstract: Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to th… ▽ More Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0 △ Less

Submitted 23 June, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

Comments: To appear in Interspeech 2021

arXiv:2012.07353 [pdf, other]

REDAT: Accent-Invariant Representation for End-to-End ASR by Domain Adversarial Training with Relabeling

Authors: Hu Hu, Xuesong Yang, Zeynab Raeesy, **xi Guo, Gokce Keskin, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Roland Maas

Abstract: Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent-invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to mi… ▽ More Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent-invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions. Motivated by the proof of equivalence, we introduce reDAT, a novel technique based on DAT, which relabels data using either unsupervised clustering or soft labels. Experiments on 23K hours of multi-accent data show that DAT achieves competitive results over accent-specific baselines on both native and non-native English accents but up to 13% relative WER reduction on unseen accents; our reDAT yields further improvements over DAT by 3% and 8% relatively on non-native accents of American and British English. △ Less

Submitted 12 February, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

Comments: accepted in ICASSP 2021; final camera-ready version

arXiv:2009.06928 [pdf, other]

doi 10.1017/jfm.2021.703

Vortex cluster arising from an axisymmetric inertial wave attractor

Authors: Samuel Boury, Ilias Sibgatullin, Evgeny Ermanyuk, Natalia Shmakova, Philippe Odier, Sylvain Joubaud, Leo R. M. Maas, Thierry Dauxois

Abstract: We present an experimental study of the saturated non-linear dynamics of an inertial wave attractor in an axisymmetric geometrical setting. The experiments are carried out in a rotating ring-shaped fluid domain delimited by two vertical coaxial cylinders, a conical bottom, and a horizontal deformable upper lid as wave generator: the meridional cross-section of the fluid volume is a trapezium, whil… ▽ More We present an experimental study of the saturated non-linear dynamics of an inertial wave attractor in an axisymmetric geometrical setting. The experiments are carried out in a rotating ring-shaped fluid domain delimited by two vertical coaxial cylinders, a conical bottom, and a horizontal deformable upper lid as wave generator: the meridional cross-section of the fluid volume is a trapezium, while the horizontal cross-section is a ring. First, the fluid is set into a rigid-body rotation. Thereafter, forcing is introduced into the system via axisymmetric low-amplitude volume-conserving oscillatory motion of the upper lid. After a short transient of about 10 forcing periods, a quasi-linear regime is established, with an axisymmetric inertial wave attractor. The attractor is prone to instability: at long time-scale (order 100 forcing periods) a saturated fully non-linear regime develops as a consequence of an energy cascade draining energy towards a slow two-dimensional manifold represented by a regular polygonal system of axially-oriented cyclonic vortices that are slowly precessing around the inner cylinder. We show that this slow two-dimensional manifold manifests a persistent slow prograde motion and a strong cyclonic-anticyclonic asymmetry quantified by the time-evolution of the probability density function of the vertical vorticity. △ Less

Submitted 15 September, 2020; originally announced September 2020.

arXiv:2008.08409 [pdf, other]

Early RTL Analysis for SCA Vulnerability in Fuzzy Extractors of Memory-Based PUF Enabled Devices

Authors: Xinhui Lai, Maksim Jenihhin, Georgios Selimis, Sven Goossens, Roel Maes, Kolin Paul

Abstract: Physical Unclonable Functions (PUFs) are gaining attention in the cryptography community because of the ability to efficiently harness the intrinsic variability in the manufacturing process. However, this means that they are noisy devices and require error correction mechanisms, e.g., by employing Fuzzy Extractors (FEs). Recent works demonstrated that applying FEs for error correction may enable n… ▽ More Physical Unclonable Functions (PUFs) are gaining attention in the cryptography community because of the ability to efficiently harness the intrinsic variability in the manufacturing process. However, this means that they are noisy devices and require error correction mechanisms, e.g., by employing Fuzzy Extractors (FEs). Recent works demonstrated that applying FEs for error correction may enable new opportunities to break the PUFs if no countermeasures are taken. In this paper, we address an attack model on FEs hardware implementations and provide a solution for early identification of the timing Side-Channel Attack (SCA) vulnerabilities which can be exploited by physical fault injection. The significance of this work stems from the fact that FEs are an essential building block in the implementations of PUF-enabled devices. The information leaked through the timing side-channel during the error correction process can reveal the FE input data and thereby can endanger revealing secrets. Therefore, it is very important to identify the potential leakages early in the process during RTL design. Experimental results based on RTL analysis of several Bose-Chaudhuri-Hocquenghem (BCH) and Reed-Solomon decoders for PUF-enabled devices with FEs demonstrate the feasibility of the proposed methodology. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: 6 pages, 5 figures, 2 tables

arXiv:2007.15909 [pdf, other]

Long-term continuous assessment of SRAM PUF and source of random numbers

Authors: Rui Wang, Georgios Selimis, Roel Maes, Sven Goossens

Abstract: The qualities of Physical Unclonable Functions (PUFs) suffer from several noticeable degradations due to silicon aging. In this paper, we investigate the long-term effects of silicon aging on PUFs derived from the start-up behavior of Static Random Access Memories (SRAM). Previous research on SRAM aging is based on transistor-level simulation or accelerated aging test at high temperature and volta… ▽ More The qualities of Physical Unclonable Functions (PUFs) suffer from several noticeable degradations due to silicon aging. In this paper, we investigate the long-term effects of silicon aging on PUFs derived from the start-up behavior of Static Random Access Memories (SRAM). Previous research on SRAM aging is based on transistor-level simulation or accelerated aging test at high temperature and voltage to observe aging effects within a short period of time. In contrast, we have run a long-term continuous power-up test on 16 Arduino Leonardo boards under nominal conditions for two years. In total, we collected around 175 million measurements for reliability, uniqueness and randomness evaluations. Analysis shows that the number of bits that flip with respect to the reference increased by 19.3% while min-entropy of SRAM PUF noise improves by 19.3% on average after two years of aging. The impact of aging on reliability is smaller under nominal conditions than was previously assessed by the accelerated aging test. The test we conduct in this work more closely resembles the conditions of a device in the field, and therefore we more accurately evaluate how silicon aging affects SRAM PUFs. △ Less

Submitted 31 July, 2020; originally announced July 2020.

arXiv:2007.15366 [pdf]

Influencia del Buffer del Router en la Distribucíon de Video P2P-TV

Authors: Idelkys Quintana, Jose Saldana, Jose Ruiz Mas, Julian Fernández Navajas, Luis A. Casadesus Pazos, Luis Sequeira

Abstract: This work presents a study of the behaviour of the router buffer when managing the traffic of P2P-TV applications, where a number of peers exchange video content. First, a summary of the characteristics of SOPCast is presented. Then, the results obtained in simulation tests using different buffer policies are presented. Real traces of the application, obtained from a research project, have been us… ▽ More This work presents a study of the behaviour of the router buffer when managing the traffic of P2P-TV applications, where a number of peers exchange video content. First, a summary of the characteristics of SOPCast is presented. Then, the results obtained in simulation tests using different buffer policies are presented. Real traces of the application, obtained from a research project, have been used for the tests, sharing the Internet access with different amounts of background traffic. The results show that a similar buffer behaviour for all the access technologies. In addition, the big amount of small packets generated may impair the video traffic, thus avoiding the retransmission of the contents by the application. △ Less

Submitted 30 July, 2020; originally announced July 2020.

Comments: in Spanish

Journal ref: Actas del XXVII Simposium Nacional de la Union Científica Internacional de Radio (URSI 2012). Elche (Spain). Sept. 2012. ISBN: 978-84-695-4327-6

arXiv:2007.15096 [pdf]

Comparison of Multiplexing Policies for FPS Games in terms of Subjective Quality

Authors: Jose Saldana, Julian Fernandez Navajas, Jose Ruiz Mas, Luis Sequeira, Luis Casadesus

Abstract: This paper compares two policies which can be used for multiplexing the traffic of a number of players of a First Person Shooter game. A network scenario in which a number of players share an access network has been simulated, in order to compare the policies in terms of a subjective quality estimator. The first policy, namely timeout, achieves higher bandwidth savings, while the second one, perio… ▽ More This paper compares two policies which can be used for multiplexing the traffic of a number of players of a First Person Shooter game. A network scenario in which a number of players share an access network has been simulated, in order to compare the policies in terms of a subjective quality estimator. The first policy, namely timeout, achieves higher bandwidth savings, while the second one, period, introduces less delay and jitter. The results show that the difference in terms of QoE is only significant when the number of players is small. Thus, in order to make the correct decision, the concrete network scenario and the characteristics of the router would have to be considered in each case, taking into account the estimation of the subjective quality that can be expected. △ Less

Submitted 29 July, 2020; originally announced July 2020.

Journal ref: Proc. II Workshop on Multimedia Data Coding and Transmission 2012, Jornadas Sarteco. Elche (Spain). Sept. 2012. ISBN: 978-84-695-4472-3

arXiv:2007.13802 [pdf, other]

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Authors: **xi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas

Abstract: In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-… ▽ More In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists. The hypothesis probability scores and back-propagated gradients are calculated efficiently using the forward-backward algorithm. Moreover, the proposed method allows us to decouple the decoding and training processes, and thus we can perform offline parallel-decoding and MWER training for each subset iteratively. Experimental results show that this proposed semi-on-the-fly method can speed up the on-the-fly method by 6 times and result in a similar WER improvement (3.6%) over a baseline RNN-T model. The proposed MWER training can also effectively reduce high-deletion errors (9.2% WER-reduction) introduced by RNN-T models when EOS is added for endpointer. Further improvement can be achieved if we use a proposed RNN-T rescoring method to re-rank hypotheses and use external RNN-LM to perform additional rescoring. The best system achieves a 5% relative improvement on an English test-set of real far-field recordings and a 11.6% WER reduction on music-domain utterances. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: Accepted to Interspeech 2020

arXiv:2007.09245 [pdf, other]

Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

Authors: Xiaosu Tong, Che-Wei Huang, Sri Harish Mallidi, Shaun Joseph, Sonal Pareek, Chander Chandak, Ariya Rastrow, Roland Maas

Abstract: In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level predicti… ▽ More In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41% compared to our previous best model on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models. △ Less

Submitted 17 July, 2020; originally announced July 2020.

arXiv:2007.03900 [pdf, other]

Streaming End-to-End Bilingual ASR Systems with Joint Language Identification

Authors: Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus Müller, Sergio Murillo, Ariya Rastrow, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann

Abstract: Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream pr… ▽ More Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2007.00131 [pdf, other]

Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition

Authors: Maarten Van Segbroeck, Harish Mallidih, Brian King, I-Fan Chen, Gurpreet Chadha, Roland Maas

Abstract: Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-fre… ▽ More Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency correlations in the acoustic input signals. A drawback of FLSTM based architectures however is that they operate at a predefined, and tuned, window size and stride, referred to as 'view' in this paper. We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views, into a dimensionality reduced feature representation. The proposed multi-view FLSTM architecture allows to model a wider range of time-frequency correlations compared to an FLSTM model with single view. When trained on 50K hours of English far-field speech data with CTC loss followed by sMBR sequence training, we show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios over an optimized single FLSTM model, while retaining a similar computational footprint. △ Less

Submitted 30 June, 2020; originally announced July 2020.

arXiv:2006.00703 [pdf, other]

Streaming Language Identification using Combination of Acoustic Representations and ASR Hypotheses

Authors: Chander Chandak, Zeynab Raeesy, Ariya Rastrow, Yuzong Liu, Xiangyang Huang, Siyu Wang, Dong Kwon Joo, Roland Maas

Abstract: This paper presents our modeling and architecture approaches for building a highly accurate low-latency language identification system to support multilingual spoken queries for voice assistants. A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel and rely on a language identification (LID) component that detects the input language. Con… ▽ More This paper presents our modeling and architecture approaches for building a highly accurate low-latency language identification system to support multilingual spoken queries for voice assistants. A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel and rely on a language identification (LID) component that detects the input language. Conventionally, LID relies on acoustic only information to detect input language. We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses resulting in up to 50% relative reduction of identification error rate, compared to a model that uses acoustic only features. Furthermore, to reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early when the system reaches a predetermined confidence level, alleviating the need to run multiple ASR systems until the end of input query. The combined acoustic and text LID, coupled with our proposed streaming runtime architecture, results in an average of 1500ms early identification for more than 50% of utterances, with almost no degradation in accuracy. We also show improved results by adopting a semi-supervised learning (SSL) technique using the newly proposed model architecture as a teacher model. △ Less

Submitted 1 June, 2020; originally announced June 2020.

Comments: 5 pages, 2 figures

arXiv:1909.13447 [pdf]

DiPCo -- Dinner Party Corpus

Authors: Maarten Van Segbroeck, Ahmed Zaid, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Xuewen Luo, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas

Abstract: We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices position… ▽ More We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set. △ Less

Submitted 30 September, 2019; originally announced September 2019.

arXiv:1904.07217 [pdf, other]

Resurgence, a problem of missing exponential corrections in asymptotic expansions

Authors: Ramon Miravitllas Mas

Abstract: It is well known that perturbative expansions of path integrals are divergent. These expansions are to be understood as asymptotic expansions, which encode the limiting behaviour of the path integral for positive small coupling. Conventionally, the method of Borel summation assigns a finite answer to the divergent expansion. Still, the Borel sum might not encode the full information of a function,… ▽ More It is well known that perturbative expansions of path integrals are divergent. These expansions are to be understood as asymptotic expansions, which encode the limiting behaviour of the path integral for positive small coupling. Conventionally, the method of Borel summation assigns a finite answer to the divergent expansion. Still, the Borel sum might not encode the full information of a function, because it misses exponentially small corrections. In the present work, we consider a slight variation of the conventional Borel summation, in which a generalised Borel transform (an inverse Laplace transform) is followed by a directional Laplace transform. These new tools will allow us to give perhaps better answers to typical problems in Borel summation: missing exponential corrections and ambiguities in the Borel summation. In addition, we will define resurgence as a connection between the discontinuity of a function and the coefficients of its asymptotic expansion. From this definition, we will be able to reduce resurgence to the problem of missing exponential corrections in asymptotic expansions and understand, within a unified framework, different approaches to resurgence found in the literature. △ Less

Submitted 15 April, 2019; originally announced April 2019.

arXiv:1901.02348 [pdf, other]

Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

Authors: Ladislav Mošner, Minhua Wu, Anirudh Raju, Sree Hari Krishnan Parthasarathi, Kenichi Kumatani, Shiva Sundaram, Roland Maas, Björn Hoffmeister

Abstract: For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis o… ▽ More For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apart from cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing to a sequence trained teacher. △ Less

Submitted 15 March, 2019; v1 submitted 5 January, 2019; originally announced January 2019.

Comments: To Appear in ICASSP 2019

arXiv:1810.02714 [pdf, ps, other]

doi 10.1017/jfm.2019.251

Particle transport induced by internal wave beam streaming in lateral boundary layers

Authors: E. Horne, F. Beckebanze, D. Micard, P. Odier, L. R. M. Maas, S. Joubaud

Abstract: Quantifying the physical mechanisms responsible for the transport of sediments, nutrients and pollutants in the abyssal sea is a long-standing problem, with internal waves regularly invoked as the relevant mechanism for particle advection near the sea bottom. This study focuses on internal-wave induced particle transport in the vicinity of (almost) vertical walls. We report a series of laboratory… ▽ More Quantifying the physical mechanisms responsible for the transport of sediments, nutrients and pollutants in the abyssal sea is a long-standing problem, with internal waves regularly invoked as the relevant mechanism for particle advection near the sea bottom. This study focuses on internal-wave induced particle transport in the vicinity of (almost) vertical walls. We report a series of laboratory experiments revealing that particles sinking slowly through a monochromatic internal wave beam experience significant horizontal advection. Extending the theoretical analysis by Beckebanze et al (2018a), we attribute the observed particle advection to a peculiar and previously unrecognized streaming mechanism originating at the lateral walls. This vertical boundary-layer streaming mechanism is most efficient for strongly inclined wave beams, when vertical and horizontal velocity components are of comparable magnitude. We find good agreement between our theoretical prediction and experimental results. △ Less

Submitted 5 October, 2018; originally announced October 2018.

Comments: 19 pages, 8 figures

arXiv:1809.07832 [pdf, other]

LSTM-based Whisper Detection

Authors: Zeynab Raeesy, Kellen Gillespie, Zhenpei Yang, Chengyuan Ma, Thomas Drugman, Jiacheng Gu, Roland Maas, Ariya Rastrow, Björn Hoffmeister

Abstract: This article presents a whisper speech detector in the far-field domain. The proposed system consists of a long-short term memory (LSTM) neural network trained on log-filterbank energy (LFBE) acoustic features. This model is trained and evaluated on recordings of human interactions with voice-controlled, far-field devices in whisper and normal phonation modes. We compare multiple inference approac… ▽ More This article presents a whisper speech detector in the far-field domain. The proposed system consists of a long-short term memory (LSTM) neural network trained on log-filterbank energy (LFBE) acoustic features. This model is trained and evaluated on recordings of human interactions with voice-controlled, far-field devices in whisper and normal phonation modes. We compare multiple inference approaches for utterance-level classification by examining trajectories of the LSTM posteriors. In addition, we engineer a set of features based on the signal characteristics inherent to whisper speech, and evaluate their effectiveness in further separating whisper from normal speech. A benchmarking of these features using multilayer perceptrons (MLP) and LSTMs suggests that the proposed features, in combination with LFBE features, can help us further improve our classifiers. We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech. In addition, we prove that the LSTM classifiers accuracy can be further improved with the incorporation of the proposed engineered features. △ Less

Submitted 5 April, 2020; v1 submitted 20 September, 2018; originally announced September 2018.

arXiv:1808.02504 [pdf, other]

Device-directed Utterance Detection

Authors: Sri Harish Mallidi, Roland Maas, Kyle Goehner, Ariya Rastrow, Spyros Matsoukas, Björn Hoffmeister

Abstract: In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $"Computer,~play~music", "Computer,~reduce~the~volume"$. In this interaction,… ▽ More In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $"Computer,~play~music", "Computer,~reduce~the~volume"$. In this interaction, the user needs to repeat the wake-word ($Computer$) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A feed-forward deep neural network (DNN) is then trained to combine the acoustic and 1-best embeddings, derived from the LSTMs, with features from the ASR decoder. Experimental results show that ASR decoder, acoustic embeddings, and 1-best embeddings yield an equal-error-rate (EER) of $9.3~\%$, $10.9~\%$ and $20.1~\%$, respectively. Combination of the features resulted in a $44~\%$ relative improvement and a final EER of $5.2~\%$. △ Less

Submitted 7 August, 2018; originally announced August 2018.

Comments: Interspeech 2018 (accepted)

arXiv:1805.12356 [pdf, other]

doi 10.1017/jfm.2019.22

Mean flow generation by three-dimensional non-linear internal wave beams

Authors: F. Beckebanze, K. Raja, L. R. M. Maas

Abstract: We study the generation of strong mean flow by weakly non-linear internal wave beams. With a perturbational expansion, we construct analytic solutions for 3D internal wave beams, exact up to first order accuracy in the viscosity parameter. We specifically focus on the subtleties of wave beam generation by oscillating boundaries, such as wave makers in laboratory set-ups. The exact solutions to the… ▽ More We study the generation of strong mean flow by weakly non-linear internal wave beams. With a perturbational expansion, we construct analytic solutions for 3D internal wave beams, exact up to first order accuracy in the viscosity parameter. We specifically focus on the subtleties of wave beam generation by oscillating boundaries, such as wave makers in laboratory set-ups. The exact solutions to the linearized equations allow us to derive an analytic expression for the mean vertical vorticity production term, which induces a horizontal mean flow. Whereas mean flow generation associated with viscous beam attenuation - known as streaming - has been described before, we are the first to also include a peculiar inviscid mean flow generation in the vicinity of the oscillating wall, resulting from line vortices at the lateral edges of the oscillating boundary. Our theoretical expression for the mean vertical vorticity production is in good agreement with earlier laboratory experiments, for which the previously unrecognized inviscid mean flow generation mechanism turns out to be significant. △ Less

Submitted 12 December, 2018; v1 submitted 31 May, 2018; originally announced May 2018.

arXiv:1805.04324 [pdf, other]

doi 10.1017/jfm.2018.236

Internal Wave Attractors in 3D Geometries : trap** by oblique reflection

Authors: G. Pillet, E. V. Ermanyuk, L. R. M. Maas, I. N. Sibgatullin, T. Dauxois

Abstract: We study experimentally the propagation of internal waves in two different three-dimensional (3D) geometries, with a special emphasis on the refractive focusing due to the 3D reflection of obliquely incident internal waves on a slope. Both studies are initiated by ray tracing calculations to determine the appropriate experimental parameters. First, we consider a 3D geometry, the classical set-up t… ▽ More We study experimentally the propagation of internal waves in two different three-dimensional (3D) geometries, with a special emphasis on the refractive focusing due to the 3D reflection of obliquely incident internal waves on a slope. Both studies are initiated by ray tracing calculations to determine the appropriate experimental parameters. First, we consider a 3D geometry, the classical set-up to get simple, 2D parallelogram-shaped attractors in which waves are forced in a direction perpendicular to a slo** bottom. Here, however, the forcing is of reduced extent in along-slope, transverse direction. We show how the refractive focusing mechanism explains the formation of attractors over the whole width of the tank, even away from the forcing region. Direct numerical simulations confirm the dynamics,emphasize the role of boundary conditions and reveal the phase shifting in the transversal direction. Second, we consider a long and narrow tank having an inclined bottom, to simply reproduce a canal. In this case, the energy is injected in a direction parallel to the slope. Interestingly, the wave energy ends up forming 2D internal wave attractors in planes that are transverse to the initial propagation direction. This focusing mechanism prevents indefinite transmission of most of the internal wave energy along the canal. △ Less

Submitted 11 May, 2018; originally announced May 2018.

Journal ref: Journal of Fluid Mechanics 845, 203-225 (2018)

arXiv:1707.08009 [pdf, other]

doi 10.1017/jfm.2018.107

Dam** of quasi-2D internal wave attractors by rigid-wall friction

Authors: F. Beckebanze, C. Brouzet, I. N. Sibgatullin, L. R. M. Maas

Abstract: The reflection of internal gravity waves at slo** boundaries leads to focusing or defocusing. In closed domains, focusing typically dominates and projects the wave energy onto 'wave attractors'. For small-amplitude internal waves, the projection of energy onto higher wave numbers by geometric focusing can be balanced by viscous dissipation at high wave numbers. Contrary to what was previously su… ▽ More The reflection of internal gravity waves at slo** boundaries leads to focusing or defocusing. In closed domains, focusing typically dominates and projects the wave energy onto 'wave attractors'. For small-amplitude internal waves, the projection of energy onto higher wave numbers by geometric focusing can be balanced by viscous dissipation at high wave numbers. Contrary to what was previously suggested, viscous dissipation in interior shear layers may not be sufficient to explain the experiments on wave attractors in the classical quasi-2D trapezoidal laboratory set-ups. Applying standard boundary layer theory, we provide an elaborate description of the viscous dissipation in the interior shear layer, as well as at the rigid boundaries. Our analysis shows that even if the thin lateral Stokes boundary layers consist of no more than 1% of the wall-to-wall distance, dissipation by lateral walls dominates at intermediate wave numbers. Our extended model for the spectrum of 3D wave attractors in equilibrium closes the gap between observations and theory by Hazewinkel et al. (2008). △ Less

Submitted 25 July, 2017; originally announced July 2017.

arXiv:1701.07320 [pdf, other]

doi 10.1109/GLOCOM.2017.8254007

A Robust SRAM-PUF Key Generation Scheme Based on Polar Codes

Authors: Bin Chen, Tanya Ignatenko, Frans M. J. Willems, Roel Maes, Erik van der Sluis, Georgios Selimis

Abstract: Physical unclonable functions (PUFs) are relatively new security primitives used for device authentication and device-specific secret key generation. In this paper we focus on SRAM-PUFs. The SRAM-PUFs enjoy uniqueness and randomness properties stemming from the intrinsic randomness of SRAM memory cells, which is a result of manufacturing variations. This randomness can be translated into the crypt… ▽ More Physical unclonable functions (PUFs) are relatively new security primitives used for device authentication and device-specific secret key generation. In this paper we focus on SRAM-PUFs. The SRAM-PUFs enjoy uniqueness and randomness properties stemming from the intrinsic randomness of SRAM memory cells, which is a result of manufacturing variations. This randomness can be translated into the cryptographic keys thus avoiding the need to store and manage the device cryptographic keys. Therefore these properties, combined with the fact that SRAM memory can be often found in today's IoT devices, make SRAM-PUFs a promising candidate for securing and authentication of the resource-constrained IoT devices. PUF observations are always effected by noise and environmental changes. Therefore secret-generation schemes with helper data are used to guarantee reliable regeneration of the PUF-based secret keys. Error correction codes (ECCs) are an essential part of these schemes. In this work, we propose a practical error correction construction for PUF-based secret generation that are based on polar codes. The resulting scheme can generate $128$-bit keys using $1024$ SRAM-PUF bits and $896$ helper data bits and achieve a failure probability of $10^{-9}$ or lower for a practical SRAM-PUFs setting with bit error probability of $15\%$. The method is based on successive cancellation combined with list decoding and hash-based checking that makes use of the hash that is already available at the decoder. In addition, an adaptive list decoder for polar codes is investigated. This decoder increases the list size only if needed. △ Less

Submitted 27 July, 2017; v1 submitted 25 January, 2017; originally announced January 2017.

Comments: 7pages, 5 figure, globecom2017

arXiv:1611.05306 [pdf, other]

doi 10.1016/j.nima.2017.05.003

Thermal and hydrodynamic studies for micro-channel cooling for large area silicon sensors in high energy physics experiments

Authors: Nils Flaschel, Dario Ariza, Sergio Diez, Marta Gerboles, Ingrid-Maria Gregor, Xavier Jorda, Roser Mas, David Quirion, Kerstin Tackmann, Miguel Ullan

Abstract: Micro-channel cooling initially aiming at small-sized high-power integrated circuits is being transferred to the field of high energy physics. Today`s prospects of micro-fabricating silicon opens a door to a more direct cooling of detector modules. The challenge in high energy physics is to save material in the detector construction and to cool large areas. In this paper, we are investigating micr… ▽ More Micro-channel cooling initially aiming at small-sized high-power integrated circuits is being transferred to the field of high energy physics. Today`s prospects of micro-fabricating silicon opens a door to a more direct cooling of detector modules. The challenge in high energy physics is to save material in the detector construction and to cool large areas. In this paper, we are investigating micro-channel cooling as a candidate for a future cooling system for silicon detectors in a generic research and development approach. The work presented in this paper includes the production and the hydrodynamic and thermal testing of a micro-channel equipped prototype optimized to achieve a homogeneous flow distribution. Furthermore, the device was simulated using finite element methods. △ Less

Submitted 9 May, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

Comments: 10 pages, submitted to NIMA (accepted)

arXiv:1604.04198 [pdf, ps, other]

Estimating parameters of nonlinear systems using the elitist particle filter based on evolutionary strategies

Authors: Christian Huemmer, Christian Hofmann, Roland Maas, Walter Kellermann

Abstract: In this article, we present the elitist particle filter based on evolutionary strategies (EPFES) as an efficient approach for nonlinear system identification. The EPFES is derived from the frequently-employed state-space model, where the relevant information of the nonlinear system is captured by an unknown state vector. Similar to classical particle filtering, the EPFES consists of a set of parti… ▽ More In this article, we present the elitist particle filter based on evolutionary strategies (EPFES) as an efficient approach for nonlinear system identification. The EPFES is derived from the frequently-employed state-space model, where the relevant information of the nonlinear system is captured by an unknown state vector. Similar to classical particle filtering, the EPFES consists of a set of particles and respective weights which represent different realizations of the latent state vector and their likelihood of being the solution of the optimization problem. As main innovation, the EPFES includes an evolutionary elitist-particle selection which combines long-term information with instantaneous sampling from an approximated continuous posterior distribution. In this article, we propose two advancements of the previously-published elitist-particle selection process. Further, the EPFES is shown to be a generalization of the widely-used Gaussian particle filter and thus evaluated with respect to the latter for two completely different scenarios: First, we consider the so-called univariate nonstationary growth model with time-variant latent state variable, where the evolutionary selection of elitist particles is evaluated for non-recursively calculated particle weights. Second, the problem of nonlinear acoustic echo cancellation is addressed in a simulated scenario with speech as input signal: By using long-term fitness measures, we highlight the efficacy of the well-generalizing EPFES in estimating the nonlinear system even for large search spaces. Finally, we illustrate similarities between the EPFES and evolutionary algorithms to outline future improvements by fusing the achievements of both fields of research. △ Less

Submitted 25 May, 2016; v1 submitted 14 April, 2016; originally announced April 2016.

Comments: 13 pages, 13 figures

arXiv:1604.02958 [pdf, ps, other]

Thermal tuners on a Silicon Nitride platform

Authors: Daniel Pérez, Juan Fernández, Rocío Baños, José David Doménech, Ana M. Sánchez, Josep M. Cirera, Roser Mas, Javier Sánchez, Sara Durán, Emilio Pardo, Carlos Domínguez, Daniel Pastor, José Capmany, Pascual Muñoz

Abstract: In this paper, the design trade-offs for the implementation of small footprint thermal tuners on silicon nitride are presented, and explored through measurements and supporting simulations of a photonic chip based on Mach-Zehnder Interferometers. Firstly, the electrical properties of the tuners are assessed, showing a compromise between compactness and deterioration. Secondly, the different variab… ▽ More In this paper, the design trade-offs for the implementation of small footprint thermal tuners on silicon nitride are presented, and explored through measurements and supporting simulations of a photonic chip based on Mach-Zehnder Interferometers. Firstly, the electrical properties of the tuners are assessed, showing a compromise between compactness and deterioration. Secondly, the different variables involved in the thermal efficiency, switching power and heater dimensions, are analysed. Finally, with focus on exploring the limits of this compact tuners with regards to on chip component density, the thermal-cross talk is also investigated. Tuners with footprint of 270x5 μm 2 and switching power of 350 mW are reported, with thermal-cross talk, in terms of induced phase change in adjacent devices of less than one order of magnitude at distances over 20 μm. Paths for the improvement of thermal efficiency, power consumption and resilience of the devices are also outlined △ Less

Submitted 11 April, 2016; originally announced April 2016.

arXiv:1411.4834 [pdf, ps, other]

doi 10.1109/LSP.2015.2439392

The NLMS algorithm with time-variant optimum stepsize derived from a Bayesian network perspective

Authors: Christian Huemmer, Roland Maas, Walter Kellermann

Abstract: In this article, we derive a new stepsize adaptation for the normalized least mean square algorithm (NLMS) by describing the task of linear acoustic echo cancellation from a Bayesian network perspective. Similar to the well-known Kalman filter equations, we model the acoustic wave propagation from the loudspeaker to the microphone by a latent state vector and define a linear observation equation (… ▽ More In this article, we derive a new stepsize adaptation for the normalized least mean square algorithm (NLMS) by describing the task of linear acoustic echo cancellation from a Bayesian network perspective. Similar to the well-known Kalman filter equations, we model the acoustic wave propagation from the loudspeaker to the microphone by a latent state vector and define a linear observation equation (to model the relation between the state vector and the observation) as well as a linear process equation (to model the temporal progress of the state vector). Based on additional assumptions on the statistics of the random variables in observation and process equation, we apply the expectation-maximization (EM) algorithm to derive an NLMS-like filter adaptation. By exploiting the conditional independence rules for Bayesian networks, we reveal that the resulting EM-NLMS algorithm has a stepsize update equivalent to the optimal-stepsize calculation proposed by Yamamoto and Kitayama in 1982, which has been adopted in many textbooks. As main difference, the instantaneous stepsize value is estimated in the M step of the EM algorithm (instead of being approximated by artificially extending the acoustic echo path). The EM-NLMS algorithm is experimentally verified for synthesized scenarios with both, white noise and male speech as input signal. △ Less

Submitted 18 November, 2014; originally announced November 2014.

Comments: 4 pages, 1 page of references

arXiv:1410.2479 [pdf, other]

doi 10.1109/ICASSP.2015.7178798

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Authors: Andreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann

Abstract: We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin… ▽ More We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin. It is shown that using the diffuseness feature as an additional input to a DNN-based acoustic model leads to a reduced word error rate for the REVERB challenge corpus, both compared to logmelspec features extracted from noisy signals, and features enhanced by spectral subtraction. △ Less

Submitted 16 February, 2015; v1 submitted 9 October, 2014; originally announced October 2014.

Comments: accepted for ICASSP2015

arXiv:1404.4707 [pdf]

Negative refractive index and higher-order harmonics in layered metallodielectric optical metamaterials

Authors: Ruben Maas, Ewold Verhagen, James Parsons, Albert Polman

Abstract: We study the propagation of light in a three-dimensional double-periodic Ag/TiO2 multilayer metamaterial composed of coupled plasmonic waveguides operating in the visible and UV spectral range. For these frequencies, light propagation in the plane of the waveguides is described by a negative phase velocity, while for the orthogonal direction light propagation is described by a Bloch wave composed… ▽ More We study the propagation of light in a three-dimensional double-periodic Ag/TiO2 multilayer metamaterial composed of coupled plasmonic waveguides operating in the visible and UV spectral range. For these frequencies, light propagation in the plane of the waveguides is described by a negative phase velocity, while for the orthogonal direction light propagation is described by a Bloch wave composed of a large number of harmonics. As a result, the material cannot generally be described by a single phase index: decomposing the Bloch wave into different harmonics we show that for the wavelength range of interest the positive index m=1 harmonic dominates the propagation of light in the orthogonal direction. These results are corroborated by numerical simulations and optical refraction experiments on a double-periodic Ag/TiO2 multilayer metamaterial prism in the 380-600 nm spectral range, which show that positive refraction associated with right-handed harmonics dominates. Studying the isofrequency contours we find that despite the occurrence of multiple harmonics the double-periodic structure can act as a flat lens: for a slab consisting of an integer number of unit cells all harmonics are degenerate and constructively interfere at the image plane. This work identifies important considerations relevant to the design of many three dimensional periodic metamaterials. △ Less

Submitted 18 April, 2014; originally announced April 2014.

Comments: 11 pages, 7 figures

arXiv:1402.2421 [pdf, other]

doi 10.1017/jfm.2013.310

Meridional trap** and zonal propagation of inertial waves in a rotating fluid shell

Authors: Anna Rabitti, Leo R. M. Maas

Abstract: Inertial waves propagate in homogeneous rotating fluids, and constitute a challenging and simplified case study for the broader class of inertio-gravity waves, present in all geophysical and astrophysical media, and responsible for energetically costly processes as diapycnal and angular momentum mixing. However, a complete analytical description and understanding of internal waves in arbitrarily s… ▽ More Inertial waves propagate in homogeneous rotating fluids, and constitute a challenging and simplified case study for the broader class of inertio-gravity waves, present in all geophysical and astrophysical media, and responsible for energetically costly processes as diapycnal and angular momentum mixing. However, a complete analytical description and understanding of internal waves in arbitrarily shaped enclosed domains, such as the ocean, or a planet liquid core, is still missing. In this work, the inviscid, linear inertial wave field is investigated by means of three dimensional ray tracing in spherical shell domains, having in mind possible oceanographic applications. Rays are here classically interpreted as representative of energy paths. But in contrast with previous studies, they are now launched with a non-zero initial zonal component allowing for a more realistic, localized forcing, and the development of azimuthal inhomogeneities. We find that meridional planes generally act in the shell geometry as attractors for ray trajectories. In addition, the existence of trajectories that are not subject to meridional trap** is here observed for the first time. Their dynamics was not captured by the previous purely meridional studies and unveils a new class of possible solutions for inertial motion in the spherical shell. Both observed behaviours shed some new light on possible mechanisms of energy localization, a key process that still deserves further investigation in our ocean, as well as in other stratified, rotating media. △ Less

Submitted 11 February, 2014; originally announced February 2014.

Journal ref: Journal of Fluid Mechanics 729 (July 24): 445-470, 2013

arXiv:1310.3099 [pdf, ps, other]

A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition

Authors: Roland Maas, Christian Huemmer, Armin Sehr, Walter Kellermann

Abstract: This article provides a unifying Bayesian network view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These exten… ▽ More This article provides a unifying Bayesian network view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules leading to a unified view on known derivations as well as to new formulations for certain approaches. The generic Bayesian perspective provided in this contribution thus highlights structural differences and similarities between the analyzed approaches. △ Less

Submitted 22 September, 2014; v1 submitted 11 October, 2013; originally announced October 2013.

arXiv:1206.3326 [pdf, ps, other]

doi 10.1063/1.4731802

Inertial waves and modes excited by the libration of a rotating cube

Authors: J. Boisson, C. Lamriben, L. R. M. Maas, P. -P. Cortet, F. Moisy

Abstract: We report experimental measurements of the flow in a cubic container submitted to a longitudinal libration, i.e. a rotation modulated in time. Velocity fields in a vertical and a horizontal plane are measured in the librating frame using a corotating particle image velocimetry system. When the libration frequency $σ_0$ is smaller than twice the mean rotation rate $Ω_0$, inertial waves can propagat… ▽ More We report experimental measurements of the flow in a cubic container submitted to a longitudinal libration, i.e. a rotation modulated in time. Velocity fields in a vertical and a horizontal plane are measured in the librating frame using a corotating particle image velocimetry system. When the libration frequency $σ_0$ is smaller than twice the mean rotation rate $Ω_0$, inertial waves can propagate in the interior of the fluid. At arbitrary excitation frequencies $σ_0<2Ω_0$, the oscillating flow shows two contributions: (i) a basic flow induced by the libration motion, and (ii) inertial wave beams propagating obliquely upward and downward from the horizontal edges of the cube. In addition to these two contributions, inertial modes may also be excited at some specific resonant frequencies. We characterize in particular the resonance of the mode of lowest order compatible with the symmetries of the forcing, noted [2,1,+]. By comparing the measured flow fields to the expected inviscid inertial modes computed numerically [L.R.M. Maas, Fluid Dyn. Res. \textbf{33}, 373 (2003)], we show that only a subset of inertial modes, matching the symmetries of the forcing, can be excited by the libration. △ Less

Submitted 14 June, 2012; originally announced June 2012.

Comments: Phys. Fluids (in press)

Journal ref: Physics of Fluids, 24, 076602 (2012)

arXiv:1010.1046 [pdf, other]

Attractive internal wave patterns

Authors: Jeroen Hazewinkel, Leo R. M. Maas, Stuart B. Dalziel

Abstract: This paper gives background information for the fluid dynamics video on internal wave motion in a trapezoidal tank. This paper gives background information for the fluid dynamics video on internal wave motion in a trapezoidal tank. △ Less

Submitted 5 October, 2010; originally announced October 2010.

Comments: 2 pg, movie at two resolutions _low(Low-resolution) and _hr(High-resolution)

arXiv:1009.0629 [pdf, other]

doi 10.1063/1.3540660

Excitation of inertial modes in a closed grid turbulence experiment under rotation

Authors: Cyril Lamriben, Pierre-Philippe Cortet, Frédéric Moisy, Leo R. M. Maas

Abstract: We report an experimental study of the decay of grid-generated turbulence in a confined geometry submitted to a global rotation. Turbulence is generated by rapidly towing a grid in a parallelepipedic water tank. The velocity fields of a large number of independent decays are measured in a vertical plane parallel to the rotation axis using a corotating Particle Image Velocimetry system. We first sh… ▽ More We report an experimental study of the decay of grid-generated turbulence in a confined geometry submitted to a global rotation. Turbulence is generated by rapidly towing a grid in a parallelepipedic water tank. The velocity fields of a large number of independent decays are measured in a vertical plane parallel to the rotation axis using a corotating Particle Image Velocimetry system. We first show that, when a "simple" grid is used, a significant amount of the kinetic energy (typically 50%) is stored in a reproducible flow composed of resonant inertial modes. The spatial structure of those inertial modes, extracted by band-pass filtering, is found compatible with the numerical results of Maas [Fluid Dyn. Res. 33, 373 (2003)]. The possible coupling between these modes and turbulence suggests that turbulence cannot be considered as freely decaying in this configuration. Finally, we demonstrate that these inertial modes may be significantly reduced (down to 15% of the total energy) by adding a set of inner tanks attached to the grid. This suggests that it is possible to produce an effectively freely decaying rotating turbulence in a confined geometry. △ Less

Submitted 3 September, 2010; originally announced September 2010.

Journal ref: Physics of Fluids, 23, 015102 (2011)

arXiv:0806.3434 [pdf, ps, other]

doi 10.1086/591120

The Role of the Radial Orbit Instability in Dark Matter Halo Formation and Structure

Authors: Jillian M. Bellovary, Julianne J. Dalcanton, Arif Babul, Thomas R. Quinn, Ryan W. Maas, Crystal G. Austin, Liliya L. R. Williams, Eric I. Barnes

Abstract: For a decade, N-body simulations have revealed a nearly universal dark matter density profile, which appears to be robust to changes in the overall density of the universe and the underlying power spectrum. Despite its universality, the physical origin of this profile has not yet been well understood. Semi--analytic models by Barnes et al. (2005) have suggested that the density structure of dark… ▽ More For a decade, N-body simulations have revealed a nearly universal dark matter density profile, which appears to be robust to changes in the overall density of the universe and the underlying power spectrum. Despite its universality, the physical origin of this profile has not yet been well understood. Semi--analytic models by Barnes et al. (2005) have suggested that the density structure of dark matter halos is determined by the onset of the radial orbit instability (ROI). We have tested this hypothesis using N-body simulations of collapsing dark matter halos with a variety of initial conditions. For dynamically cold initial conditions, the resulting halo structures are triaxial in shape, due to the mild aspect of the instability. We examine how variations in initial velocity dispersion affect the onset of the instability, and find that an isotropic velocity dispersion can suppress the ROI entirely, while a purely radial dispersion does not. The quantity sigma^2/vc^2 is a criterion for instability, where regions with sigma^2/vc^2 <~1 become triaxial due to the ROI or other perturbations. We also find that the radial orbit instability sets a scale length at which the velocity dispersion changes rapidly from isotropic to radially anisotropic. This scale length is proportional to the radius at which the density profile changes shape, as is the case in the semi--analytic models; however, the coefficient of proportionality is different by a factor of ~2.5. We conclude that the radial orbit instability is likely to be a key physical mechanism responsible for the nearly universal profiles of simulated dark matter halos. △ Less

Submitted 20 June, 2008; originally announced June 2008.

Comments: 13 pages, 12 figures, accepted to ApJ

arXiv:0707.0737 [pdf, ps, other]

doi 10.1086/587977

The Causes of Halo Shape Changes Induced by Cooling Baryons: Disks Versus Substructures

Authors: Victor P. Debattista, Ben Moore, Thomas Quinn, Stelios Kazantzidis, Ryan Maas, Lucio Mayer, Justin Read, Joachim Stadel

Abstract: Cold dark matter cosmogony predicts triaxial dark matter halos, whereas observations find quite round halos. This is most likely due to the condensation of baryons leading to rounder halos. We examine the halo phase space distribution basis for such shape changes. Triaxial halos are supported by box orbits, which pass arbitrarily close to the density center. The decrease in triaxiality caused by… ▽ More Cold dark matter cosmogony predicts triaxial dark matter halos, whereas observations find quite round halos. This is most likely due to the condensation of baryons leading to rounder halos. We examine the halo phase space distribution basis for such shape changes. Triaxial halos are supported by box orbits, which pass arbitrarily close to the density center. The decrease in triaxiality caused by baryons is thought to be due to the scattering of these orbits. We test this hypothesis with simulations of disks grown inside triaxial halos. After the disks are grown we check whether the phase space structure has changed by evaporating the disks and comparing the initial and final states. While the halos are substantially rounder when the disk is at full mass, their final shape after the disk is evaporated is not much different from the initial. Likewise, the halo becomes (more) radially anisotropic when the disk is grown, but the final anisotropy is consistent with the initial. Only if the baryons are unreasonably compact or massive does the halo change irreversibly. We show that the character of individual orbits is not generally changed by the growing mass. Thus the central condensation of baryons does not destroy enough box orbits to cause the shape change. Rather, box orbits merely become rounder along with the global potential. However, if angular momentum is transferred to the halo, either via satellites or via bars, a large irreversible change in the halo distribution occurs. The ability of satellites to alter the phase space distribution of the halo is of particular concern to galaxy formation simulations since halo triaxiality can profoundly influence the evolution of disks. △ Less

Submitted 12 June, 2008; v1 submitted 5 July, 2007; originally announced July 2007.

Comments: 35 pages, 13 figures (3 in color). Accepted to ApJ. This version is expanded, with new simulations included in response to referee. Conclusions remain unchanged

Showing 1–46 of 46 results for author: Maas, R