Showing 1–2 of 2 results for author: Herve, N

Search v0.5.6 released 2020-02-24

arXiv:2207.01893 [pdf, other]

cs.CL

ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks

Authors: Valentin Pelloin, Franck Dary, Nicolas Herve, Benoit Favre, Nathalie Camelin, Antoine Laurent, Laurent Besacier

Abstract: We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New… ▽ More We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: Interspeech 2022 (Camera Ready)
arXiv:2001.04139 [pdf, ps, other]

cs.IR cs.SI

Représentations lexicales pour la détection non supervisée d'événements dans un flux de tweets : étude sur des corpus français et anglais

Authors: Béatrice Mazoyer, Nicolas Hervé, Céline Hudelot, Julia Cage

Abstract: In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Se… ▽ More In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task. △ Less

Submitted 13 January, 2020; originally announced January 2020.

Comments: in French. Extraction et Gestion des connaissances, EGC 2020, Jan 2020, Bruxelles, France

Search v0.5.6 released 2020-02-24