Search | arXiv e-print repository

Control theory and splitting methods

Authors: Karine Beauchard, Adrien Laurent, Frédéric Marbach

Abstract: Our goal is to highlight some of the deep links between numerical splitting methods and control theory. We consider evolution equations of the form $\dot{x} = f_0(x) + f_1(x)$, where $f_0$ encodes a non-reversible dynamic, so that one is interested in schemes only involving forward flows of $f_0$. In this context, a splitting method can be interpreted as a trajectory of the control-affine system… ▽ More Our goal is to highlight some of the deep links between numerical splitting methods and control theory. We consider evolution equations of the form $\dot{x} = f_0(x) + f_1(x)$, where $f_0$ encodes a non-reversible dynamic, so that one is interested in schemes only involving forward flows of $f_0$. In this context, a splitting method can be interpreted as a trajectory of the control-affine system $\dot{x}(t)=f_0(x(t))+u(t)f_1(x(t))$, associated with a control~$u$ which is a finite sum of Dirac masses. The general goal is then to find a control such that the flow of $f_0 + u(t) f_1$ is as close as possible to the flow of $f_0+f_1$. Using this interpretation and classical tools from control theory, we revisit well-known results concerning numerical splitting methods, and we prove a handful of new ones, with an emphasis on splittings with additional positivity conditions on the coefficients. First, we show that there exist numerical schemes of any arbitrary order involving only forward flows of $f_0$ if one allows complex coefficients for the flows of $f_1$. Equivalently, for complex-valued controls, we prove that the Lie algebra rank condition is equivalent to the small-time local controllability of a system. Second, for real-valued coefficients, we show that the well-known order restrictions are linked with so-called "bad" Lie brackets from control theory, which are known to yield obstructions to small-time local controllability. We use our recent basis of the free Lie algebra to precisely identify the conditions under which high-order methods exist. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 35 pages

arXiv:2406.10073 [pdf, other]

Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Authors: Rémi Uro, Marie Tahon, David Doukhan, Antoine Laurent, Albert Rilliard

Abstract: Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in mul… ▽ More Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: keywords : Spoken interaction, Media, TV, Radio, Transition-Relevance Places, Turn Taking, Interruption. Accepted to InterSpeech 2024, Kos Island, Greece

arXiv:2404.17552 [pdf, other]

A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification

Authors: Rémi Uro, David Doukhan, Albert Rilliard, Laëtitia Larcher, Anissa-Claire Adgharouamane, Marie Tahon, Antoine Laurent

Abstract: This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For eac… ▽ More This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: Keywords:, semi-automatic processing, corpus creation, diarization, speaker identification, gender-balanced, age-balanced, speaker corpus, diachrony

Journal ref: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 3271-3280, Marseille, 20-25 June 2022. European Language Resources Association (ELRA)

arXiv:2309.07478 [pdf, other]

doi 10.1109/LSP.2023.3313513

Direct Text to Speech Translation System using Acoustic Units

Authors: Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret

Abstract: This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to… ▽ More This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures

arXiv:2307.13012 [pdf, other]

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Authors: Martin Lebourdais, Théo Mariotte, Marie Tahon, Anthony Larcher, Antoine Laurent, Silvio Montresor, Sylvain Meignier, Jean-Hugh Thomas

Abstract: Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking inf… ▽ More Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing. △ Less

Submitted 24 July, 2023; originally announced July 2023.

arXiv:2307.07984 [pdf, ps, other]

doi 10.3934/jcd.2023011

The Lie derivative and Noether's theorem on the aromatic bicomplex for the study of volume-preserving numerical integrators

Authors: Adrien Laurent

Abstract: The aromatic bicomplex is an algebraic tool based on aromatic Butcher trees and used in particular for the explicit description of volume-preserving affine-equivariant numerical integrators. The present work defines new tools inspired from variational calculus such as the Lie derivative, different concepts of symmetries, and Noether's theory in the context of aromatic forests. The approach allows… ▽ More The aromatic bicomplex is an algebraic tool based on aromatic Butcher trees and used in particular for the explicit description of volume-preserving affine-equivariant numerical integrators. The present work defines new tools inspired from variational calculus such as the Lie derivative, different concepts of symmetries, and Noether's theory in the context of aromatic forests. The approach allows to draw a correspondence between aromatic volume-preserving methods and symmetries on the Euler-Lagrange complex, to write Noether's theorem in the aromatic context, and to describe the aromatic B-series of volume-preserving methods explicitly with the Lie derivative. △ Less

Submitted 28 November, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

Comments: 14 pages

MSC Class: 58E30; 58J10; 05C05; 41A58; 37M15; 58A12

arXiv:2306.00789 [pdf, other]

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Authors: Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass

Abstract: Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAM… ▽ More Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark. △ Less

Submitted 25 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.10993 [pdf, ps, other]

The universal equivariance properties of exotic aromatic B-series

Authors: Adrien Laurent, Hans Z. Munthe-Kaas

Abstract: Exotic aromatic B-series were originally introduced for the calculation of order conditions for the high order numerical integration of ergodic stochastic differential equations in $\mathbb{R}^d$ and on manifolds. We prove in this paper that exotic aromatic B-series satisfy a universal geometric property, namely that they are characterised by locality and orthogonal-equivariance. This characterisa… ▽ More Exotic aromatic B-series were originally introduced for the calculation of order conditions for the high order numerical integration of ergodic stochastic differential equations in $\mathbb{R}^d$ and on manifolds. We prove in this paper that exotic aromatic B-series satisfy a universal geometric property, namely that they are characterised by locality and orthogonal-equivariance. This characterisation confirms that exotic aromatic B-series are a fundamental geometric object that naturally generalises aromatic B-series and B-series, as they share similar equivariance properties. In addition, we classify with stronger equivariance properties the main subsets of the exotic aromatic B-series, in particular the exotic B-series. Along the analysis, we present a generalised definition of exotic aromatic trees, dual vector fields, and we explore the impact of degeneracies on the classification. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 25 pages

MSC Class: 15A72; 37C81; 41A58; 60H35; 65C30

arXiv:2301.10998 [pdf, ps, other]

doi 10.1017/fms.2023.63

The aromatic bicomplex for the description of divergence-free aromatic forms and volume-preserving integrators

Authors: Adrien Laurent, Robert I. McLachlan, Hans Z. Munthe-Kaas, Olivier Verdier

Abstract: Aromatic B-series were introduced as an extension of standard Butcher-series for the study of volume-preserving integrators. It was proven with their help that the only volume-preserving B-series method is the exact flow of the differential equation. The question was raised whether there exists a volume-preserving integrator that can be expanded as an aromatic B-series. In this work, we introduce… ▽ More Aromatic B-series were introduced as an extension of standard Butcher-series for the study of volume-preserving integrators. It was proven with their help that the only volume-preserving B-series method is the exact flow of the differential equation. The question was raised whether there exists a volume-preserving integrator that can be expanded as an aromatic B-series. In this work, we introduce a new algebraic tool, called the aromatic bicomplex, similar to the variational bicomplex in variational calculus. We prove the exactness of this bicomplex and use it to describe explicitly the key object in the study of volume-preserving integrators: the aromatic forms of vanishing divergence. The analysis provides us with a handful of new tools to study aromatic B-series, gives insights on the process of integration by parts of trees, and allows to describe explicitly the aromatic B-series of a volume-preserving integrator. In particular, we conclude that an aromatic Runge-Kutta method cannot preserve volume. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: 41 pages

MSC Class: 65L06; 41A58; 58J10; 58A12; 37M15; 05C05

Journal ref: Forum of Mathematics Sigma 11 (2023), E69

arXiv:2211.07940 [pdf, other]

doi 10.1016/j.swevo.2022.101205

A Metaheuristic Approach for Mining Gradual Patterns

Authors: Dickson Odhiambo Owuor, Thomas Runkler, Anne Laurent

Abstract: Swarm intelligence is a discipline that studies the collective behavior that is produced by local interactions of a group of individuals with each other and with their environment. In Computer Science domain, numerous swarm intelligence techniques are applied to optimization problems that seek to efficiently find best solutions within a search space. Gradual pattern mining is another Computer Scie… ▽ More Swarm intelligence is a discipline that studies the collective behavior that is produced by local interactions of a group of individuals with each other and with their environment. In Computer Science domain, numerous swarm intelligence techniques are applied to optimization problems that seek to efficiently find best solutions within a search space. Gradual pattern mining is another Computer Science field that could benefit from the efficiency of swarm based optimization techniques in the task of finding gradual patterns from a huge search space. A gradual pattern is a rule-based correlation that describes the gradual relationship among the attributes of a data set. For example, given attributes {G,H} of a data set a gradual pattern may take the form: "the less G, the more H". In this paper, we propose a numeric encoding for gradual pattern candidates that we use to define an effective search space. In addition, we present a systematic study of several meta-heuristic optimization techniques as efficient solutions to the problem of finding gradual patterns using our search space. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: 42 pages

arXiv:2211.07795 [pdf, other]

On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

Authors: Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

Abstract: Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data… ▽ More Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data. The process is iterative, where the student becomes the teacher for the next DUST iteration. A crucial step that precedes the student model training in each DUST iteration is filtering out noisy PLs that could lead the student model astray. In DUST, we proposed a simple, effective, and theoretically sound PL filtering strategy based on the teacher model's uncertainty about its predictions on unlabeled speech utterances. We estimate the model's uncertainty by computing disagreement amongst multiple samples drawn from the teacher model during inference by injecting noise via dropout. In this work, we show that DUST's PL filtering, as initially used, may fail under severe source and target domain mismatch. We suggest several approaches to eliminate or alleviate this issue. Further, we bring insights from the research in neural network model calibration to DUST and show that a well-calibrated model correlates strongly with a positive outcome of the DUST PL filtering step. △ Less

Submitted 14 November, 2022; originally announced November 2022.

arXiv:2209.04167 [pdf, other]

Overlapped speech and gender detection with WavLM pre-trained features

Authors: Martin Lebourdais, Marie Tahon, Antoine Laurent, Sylvain Meignier

Abstract: This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has t… ▽ More This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) and a gender detection (GD) systems. In this study, we use two different corpora. The DIHARD III corpus which is well adapted for the OSD task but lack gender information. The ALLIES corpus fits with the project application context. Our best OSD system is a Temporal Convolutional Network (TCN) with WavLM pre-trained features as input, which reaches a new state-of-the-art F1-score performance on DIHARD. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 97.9%. This work opens new perspectives for human science researchers regarding the differences of representation between women and men in French media. △ Less

Submitted 9 September, 2022; originally announced September 2022.

Comments: Submitted and accepted to Interspeech 2022

arXiv:2208.14795 [pdf, other]

doi 10.1007/s13042-021-01390-w

Ant Colony Optimization for Mining Gradual Patterns

Authors: Dickson Odhiambo Owuor, Thomas Runkler, Anne Laurent, Joseph Orero, Edmond Menya

Abstract: Gradual pattern extraction is a field in (KDD) Knowledge Discovery in Databases that maps correlations between attributes of a data set as gradual dependencies. A gradual dependency may take a form of "the more Attribute K , the less Attribute L". In this paper, we propose an ant colony optimization technique that uses a probabilistic approach to learn and extract frequent gradual patterns. Throug… ▽ More Gradual pattern extraction is a field in (KDD) Knowledge Discovery in Databases that maps correlations between attributes of a data set as gradual dependencies. A gradual dependency may take a form of "the more Attribute K , the less Attribute L". In this paper, we propose an ant colony optimization technique that uses a probabilistic approach to learn and extract frequent gradual patterns. Through computational experiments on real-world data sets, we compared the performance of our ant-based algorithm to an existing gradual item set extraction algorithm and we found out that our algorithm outperforms the later especially when dealing with large data sets. △ Less

Submitted 31 August, 2022; originally announced August 2022.

Comments: 35 pages, journal article

Journal ref: Int. J. Mach. Learn. & Cyber. 12, 2989--3009 (2021)

arXiv:2207.01893 [pdf, other]

ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks

Authors: Valentin Pelloin, Franck Dary, Nicolas Herve, Benoit Favre, Nathalie Camelin, Antoine Laurent, Laurent Besacier

Abstract: We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New… ▽ More We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: Interspeech 2022 (Camera Ready)

arXiv:2205.08180 [pdf, other]

doi 10.1109/JSTSP.2022.3192714

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

Authors: Sameer Khurana, Antoine Laurent, James Glass

Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a s… ▽ More We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets. △ Less

Submitted 17 May, 2022; originally announced May 2022.

arXiv:2205.01987 [pdf, ps, other]

ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks

Authors: Marcely Zanon Boito, John Ortega, Hugo Riguidel, Antoine Laurent, Loïc Barrault, Fethi Bougares, Firas Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gahbiche, Yannick Estève

Abstract: This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tu… ▽ More This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR. Our results show that in our settings pipeline approaches are still very competitive, and that with the use of transfer learning, they can outperform end-to-end models for speech translation (ST). For the Tamasheq-French dataset (low-resource track) our primary submission leverages intermediate representations from a wav2vec 2.0 model trained on 234 hours of Tamasheq audio, while our contrastive model uses a French phonetic transcription of the Tamasheq audio as input in a Conformer speech translation architecture jointly trained on automatic speech recognition, ST and machine translation losses. Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models. Results also illustrate that even approximate phonetic transcriptions can improve ST scores. △ Less

Submitted 4 May, 2022; originally announced May 2022.

Comments: IWSLT 2022 system paper

arXiv:2110.03560 [pdf, ps, other]

doi 10.1109/ICASSP43922.2022.9746276

Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0

Authors: Sameer Khurana, Antoine Laurent, James Glass

Abstract: We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a modera… ▽ More We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a moderate-sized unlabeled speech dataset in the target language. A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model, which is trained on fifty-three languages, on the target language ASR task. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2110.03222 [pdf, ps, other]

doi 10.1137/21M1455188

A uniformly accurate scheme for the numerical integration of penalized Langevin dynamics

Authors: Adrien Laurent

Abstract: In molecular dynamics, penalized overdamped Langevin dynamics are used to model the motion of a set of particles that follow constraints up to a parameter $\varepsilon$. The most used schemes for simulating these dynamics are the Euler integrator in $\mathbb{R}^d$ and the constrained Euler integrator. Both have weak order one of accuracy, but work properly only in specific regimes depending on the… ▽ More In molecular dynamics, penalized overdamped Langevin dynamics are used to model the motion of a set of particles that follow constraints up to a parameter $\varepsilon$. The most used schemes for simulating these dynamics are the Euler integrator in $\mathbb{R}^d$ and the constrained Euler integrator. Both have weak order one of accuracy, but work properly only in specific regimes depending on the size of the parameter $\varepsilon$. We propose in this paper a new consistent method with an accuracy independent of $\varepsilon$ for solving penalized dynamics on a manifold of any dimension. Moreover, this method converges to the constrained Euler scheme when $\varepsilon$ goes to zero. The numerical experiments confirm the theoretical findings, in the context of weak convergence and for the invariant measure, on a torus and on the orthogonal group in high dimension and high codimension. △ Less

Submitted 31 August, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: 27 pages

MSC Class: 60H35; 70H45; 37M25

Journal ref: SIAM J. Sci. Comput. 44 (2022), no. 5, A2895-C398

arXiv:2104.04045 [pdf, other]

End-to-end speaker segmentation for overlap-aware resegmentation

Authors: Hervé Bredin, Antoine Laurent

Abstract: Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization app… ▽ More Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 17% on AMI, 13% on DIHARD 3, and 13% on VoxConverse. △ Less

Submitted 10 June, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: Camera-ready version for Interspeech 2021 with significantly better voice activity detection, overlapped speech detection, and speaker diarization results. The code used for results reported in v1 contained a small bug that has now been fixed

arXiv:2102.01013 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413581

End2End Acoustic to Semantic Transduction

Authors: Valentin Pelloin, Nathalie Camelin, Antoine Laurent, Renato De Mori, Antoine Caubrière, Yannick Estève, Sylvain Meignier

Abstract: In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system re… ▽ More In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context. △ Less

Submitted 1 February, 2021; originally announced February 2021.

Comments: Accepted at IEEE ICASSP 2021

Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2006.09743 [pdf, ps, other]

doi 10.1007/s10208-021-09495-y

Order conditions for sampling the invariant measure of ergodic stochastic differential equations on manifolds

Authors: Adrien Laurent, Gilles Vilmart

Abstract: We derive a new methodology for the construction of high order integrators for sampling the invariant measure of ergodic stochastic differential equations with dynamics constrained on a manifold. We obtain the order conditions for sampling the invariant measure for a class of Runge-Kutta methods applied to the constrained overdamped Langevin equation. The analysis is valid for arbitrarily high ord… ▽ More We derive a new methodology for the construction of high order integrators for sampling the invariant measure of ergodic stochastic differential equations with dynamics constrained on a manifold. We obtain the order conditions for sampling the invariant measure for a class of Runge-Kutta methods applied to the constrained overdamped Langevin equation. The analysis is valid for arbitrarily high order and relies on an extension of the exotic aromatic Butcher-series formalism. To illustrate the methodology, a method of order two is introduced, and numerical experiments on the sphere, the torus and the special linear group confirm the theoretical findings. △ Less

Submitted 26 January, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: 40 pages

MSC Class: 60H35; 70H45; 37M25; 65L06

Journal ref: Found. Comput. Math. 22, 649-695 (2022)

arXiv:2006.02814 [pdf, other]

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

Authors: Sameer Khurana, Antoine Laurent, James Glass

Abstract: More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but… ▽ More More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task. △ Less

Submitted 5 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

arXiv:2006.02547 [pdf, other]

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Authors: Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

Abstract: Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs… ▽ More Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples. △ Less

Submitted 8 September, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: Proceedings of Interspeech, 2020

arXiv:2005.08520 [pdf, other]

Robust Training of Vector Quantized Bottleneck Models

Authors: Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

Abstract: In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representat… ▽ More In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line $k$-means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: Published at IJCNN 2020

arXiv:2003.14188 [pdf, other]

doi 10.1364/OE.394011

Realization and simulation of high power holmium doped fiber laser for long-range transmission

Authors: Julien Le Gouët, François Gustave, Pierre Bourdon, Thierry Robin, Arnaud Laurent, Benoit Cadier

Abstract: We report on our realization of a high power holmium doped fiber laser, together with the validation of our numerical simulation of the laser. We first present the rare absolute measurements of the physical parameters that are mandatory to model accurately the laser-holmium interactions in our silica fiber. We then describe the realization of the clad-pumped laser, based on a triple-clad large mod… ▽ More We report on our realization of a high power holmium doped fiber laser, together with the validation of our numerical simulation of the laser. We first present the rare absolute measurements of the physical parameters that are mandatory to model accurately the laser-holmium interactions in our silica fiber. We then describe the realization of the clad-pumped laser, based on a triple-clad large mode area holmium (Ho) doped silica fiber. The output signal power is 90 W at 2120 nm, with an efficiency of about 50% with respect to the coupled pump power. This efficiency corresponds to the state of the art for clad-pumped Ho-doped fiber lasers in the 100 W power class. By comparing the experimental results to our simulation, we demonstrate its validity, and use it to show that the efficiency is limited, for our fiber, by the non-saturable absorption caused by pair induced quenching between adjacent holmium ions. △ Less

Submitted 31 March, 2020; originally announced March 2020.

Comments: 14 pages, 7 figures

arXiv:1909.13332 [pdf, other]

doi 10.1007/978-3-030-31372-2_4

Recent Advances in End-to-End Spoken Language Understanding

Authors: Natalia Tomashenko, Antoine Caubriere, Yannick Esteve, Antoine Laurent, Emmanuel Morin

Abstract: This work investigates spoken language understanding (SLU) systems in the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. Two SLU tasks are considered: named entity recognition (NER) and semantic slot filling (SF). For these tasks, in order to improve the model performance, we explore various techniques inclu… ▽ More This work investigates spoken language understanding (SLU) systems in the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. Two SLU tasks are considered: named entity recognition (NER) and semantic slot filling (SF). For these tasks, in order to improve the model performance, we explore various techniques including speaker adaptation, a modification of the connectionist temporal classification (CTC) training criterion, and sequential pretraining. △ Less

Submitted 29 September, 2019; originally announced September 2019.

Journal ref: Statistical Language and Speech Processing. SLSP 2019

arXiv:1906.07601 [pdf, other]

Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability

Authors: Antoine Caubrière, Natalia Tomashenko, Antoine Laurent, Emmanuel Morin, Nathalie Camelin, Yannick Estève

Abstract: We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture.… ▽ More We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture. Experiments are carried out on the French MEDIA and PORTMEDIA corpora and show that this end-to-end SLU approach reaches the best results ever published on this task. We compare our approach to a classical pipeline approach that uses ASR, POS tagging, lemmatizer, chunker... and other NLP tools that aim to enrich ASR outputs that feed an SLU text to concepts system. Last, we explore the promising capacity of our end-to-end SLU approach to address the problem of domain portability. △ Less

Submitted 18 June, 2019; originally announced June 2019.

Comments: Accepted to the INTERSPEECH 2019 conference. Submitted on March 29, 2019 (Paper submission deadline)

arXiv:1902.01716 [pdf, ps, other]

doi 10.1137/19M1243075

Multirevolution integrators for differential equations with fast stochastic oscillations

Authors: Adrien Laurent, Gilles Vilmart

Abstract: We introduce a new methodology based on the multirevolution idea for constructing integrators for stochastic differential equations in the situation where the fast oscillations themselves are driven by a Stratonovich noise. Applications include in particular highly-oscillatory Kubo oscillators and spatial discretizations of the nonlinear Schrödinger equation with fast white noise dispersion. We co… ▽ More We introduce a new methodology based on the multirevolution idea for constructing integrators for stochastic differential equations in the situation where the fast oscillations themselves are driven by a Stratonovich noise. Applications include in particular highly-oscillatory Kubo oscillators and spatial discretizations of the nonlinear Schrödinger equation with fast white noise dispersion. We construct a method of weak order two with computational cost and accuracy both independent of the stiffness of the oscillations. A geometric modification that conserves exactly quadratic invariants is also presented. △ Less

Submitted 5 October, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

Comments: 27 pages

MSC Class: 60H35; 35Q55; 34E13

Journal ref: SIAM J. Sci. Comput. 42 (2020), no. 1, A115-A139

arXiv:1805.12045 [pdf, other]

End-to-end named entity extraction from speech

Authors: Sahar Ghannay, Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin

Abstract: Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-opt… ▽ More Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-optimal in regards to the final task, reduced space search at the ASR output level...) and it is known that more integrated approaches outperform sequential ones, when they can be applied. In this paper, we present a first study of end-to-end approach that directly extracts named entities from speech, though a unique neural architecture. On a such way, a joint optimization is able for both ASR and NER. Experiments are carried on French data easily accessible, composed of data distributed in several evaluation campaign. Experimental results show that this end-to-end approach provides better results (F-measure=0.69 on test data) than a classical pipeline approach to detect named entity categories (F-measure=0.65). △ Less

Submitted 30 May, 2018; originally announced May 2018.

Comments: Submitted to Interspeech 2018

ACM Class: I.2.7

arXiv:1707.02877 [pdf, ps, other]

doi 10.1090/mcom/3455

Exotic aromatic B-series for the study of long time integrators for a class of ergodic SDEs

Authors: Adrien Laurent, Gilles Vilmart

Abstract: We introduce a new algebraic framework based on a modification (called exotic) of aromatic Butcher-series for the systematic study of the accuracy of numerical integrators for the invariant measure of a class of ergodic stochastic differential equations (SDEs) with additive noise. The proposed analysis covers Runge-Kutta type schemes including the cases of partitioned methods and postprocessed met… ▽ More We introduce a new algebraic framework based on a modification (called exotic) of aromatic Butcher-series for the systematic study of the accuracy of numerical integrators for the invariant measure of a class of ergodic stochastic differential equations (SDEs) with additive noise. The proposed analysis covers Runge-Kutta type schemes including the cases of partitioned methods and postprocessed methods. We also show that the introduced exotic aromatic B-series satisfy an isometric equivariance property. △ Less

Submitted 1 July, 2019; v1 submitted 10 July, 2017; originally announced July 2017.

Comments: 33 pages

MSC Class: 60H35; 37M25; 65L06; 41A58

Journal ref: Math. Comp. 89 (2020), 169-202

arXiv:1702.06154 [pdf, other]

Role model detection using low rank similarity matrix

Authors: Sibo Cheng, Adissa Laurent, Paul Van Dooren

Abstract: Computing meaningful clusters of nodes is crucial to analyse large networks. In this paper, we apply new clustering methods to improve the computational time. We use the properties of the adjacency matrix to obtain better role extraction. We also define a new non-recursive similarity measure and compare its results with the ones obtained with Browet's similarity measure. We will show the extractio… ▽ More Computing meaningful clusters of nodes is crucial to analyse large networks. In this paper, we apply new clustering methods to improve the computational time. We use the properties of the adjacency matrix to obtain better role extraction. We also define a new non-recursive similarity measure and compare its results with the ones obtained with Browet's similarity measure. We will show the extraction of the different roles with a linear time complexity. Finally, we test our algorithm with real data structures and analyse the limit of our algorithm. △ Less

Submitted 28 January, 2017; originally announced February 2017.

arXiv:1701.03675 [pdf, other]

doi 10.18637/jss.v081.i03

Tutorial in Joint Modeling and Prediction: a Statistical Software for Correlated Longitudinal Outcomes, Recurrent Events and a Terminal Event

Authors: Agnieszka Król, Audrey Mauguen, Yassin Mazroui, Alexandre Laurent, Stefan Michiels, Virginie Rondeau

Abstract: Extensions in the field of joint modeling of correlated data and dynamic predictions improve the development of prognosis research. The R package frailtypack provides estimations of various joint models for longitudinal data and survival events. In particular, it fits models for recurrent events and a terminal event (frailtyPenal), models for two survival outcomes for clustered data (frailtyPenal)… ▽ More Extensions in the field of joint modeling of correlated data and dynamic predictions improve the development of prognosis research. The R package frailtypack provides estimations of various joint models for longitudinal data and survival events. In particular, it fits models for recurrent events and a terminal event (frailtyPenal), models for two survival outcomes for clustered data (frailtyPenal), models for two types of recurrent events and a terminal event (multivPenal), models for a longitudinal biomarker and a terminal event (longiPenal) and models for a longitudinal biomarker, recurrent events and a terminal event (trivPenal). The estimators are obtained using a standard and penalized maximum likelihood approach, each model function allows to evaluate goodness-of-fit analyses and plots of baseline hazard functions. Finally, the package provides individual dynamic predictions of the terminal event and evaluation of predictive accuracy. This paper presents theoretical models with estimation techniques, applies the methods for predictions and illustrates frailtypack functions details with examples. △ Less

Submitted 13 January, 2017; originally announced January 2017.

Comments: Journal of Statistical Software (conditionally accepted for publication)

arXiv:1502.02053 [pdf, other]

Negative refraction and tiling billiards

Authors: Diana Davis, Kelsey DiPietro, Jenny Rustad, Alexander St Laurent

Abstract: We introduce a new dynamical system that we call "tiling billiards," where trajectories refract through planar tilings. This system is motivated by a recent discovery of physical substances with negative indices of refraction. We investigate several special cases where the planar tiling is created by dividing the plane by lines, and we describe the results of computer experiments. We introduce a new dynamical system that we call "tiling billiards," where trajectories refract through planar tilings. This system is motivated by a recent discovery of physical substances with negative indices of refraction. We investigate several special cases where the planar tiling is created by dividing the plane by lines, and we describe the results of computer experiments. △ Less

Submitted 20 September, 2017; v1 submitted 6 February, 2015; originally announced February 2015.

Comments: 28 pages, 25 figures

MSC Class: 37E99

Showing 1–33 of 33 results for author: Laurent, A