-
Controlling Utterance Length in NMT-based Word Segmentation with Attention
Authors:
Pierre Godard,
Laurent Besacier,
Francois Yvon
Abstract:
One of the basic tasks of computational language documentation (CLD) is to identify word boundaries in an unsegmented phonemic stream. While several unsupervised monolingual word segmentation algorithms exist in the literature, they are challenged in real-world CLD settings by the small amount of available data. A possible remedy is to take advantage of glosses or translation in a foreign, well-re…
▽ More
One of the basic tasks of computational language documentation (CLD) is to identify word boundaries in an unsegmented phonemic stream. While several unsupervised monolingual word segmentation algorithms exist in the literature, they are challenged in real-world CLD settings by the small amount of available data. A possible remedy is to take advantage of glosses or translation in a foreign, well-resourced, language, which often exist for such data. In this paper, we explore and compare ways to exploit neural machine translation models to perform unsupervised boundary detection with bilingual information, notably introducing a new loss function for jointly learning alignment and segmentation. We experiment with an actual under-resourced language, Mboshi, and show that these techniques can effectively control the output segmentation length.
△ Less
Submitted 18 October, 2019;
originally announced October 2019.
-
Unsupervised Word Segmentation from Speech with Attention
Authors:
Pierre Godard,
Marcely Zanon-Boito,
Lucas Ondel,
Alexandre Berard,
François Yvon,
Aline Villavicencio,
Laurent Besacier
Abstract:
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-ph…
▽ More
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi; comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for language documentation.
△ Less
Submitted 18 June, 2018;
originally announced June 2018.
-
XNMT: The eXtensible Neural Machine Translation Toolkit
Authors:
Graham Neubig,
Matthias Sperber,
Xinyi Wang,
Matthieu Felix,
Austin Matthews,
Sarguna Padmanabhan,
Ye Qi,
Devendra Singh Sachan,
Philip Arthur,
Pierre Godard,
John Hewitt,
Rachid Riad,
Liming Wang
Abstract:
This paper describes XNMT, the eXtensible Neural Machine Translation toolkit. XNMT distin- guishes itself from other open-source NMT toolkits by its focus on modular code design, with the purpose of enabling fast iteration in research and replicable, reliable results. In this paper we describe the design of XNMT and its experiment configuration system, and demonstrate its utility on the tasks of m…
▽ More
This paper describes XNMT, the eXtensible Neural Machine Translation toolkit. XNMT distin- guishes itself from other open-source NMT toolkits by its focus on modular code design, with the purpose of enabling fast iteration in research and replicable, reliable results. In this paper we describe the design of XNMT and its experiment configuration system, and demonstrate its utility on the tasks of machine translation, speech recognition, and multi-tasked machine translation/parsing. XNMT is available open-source at https://github.com/neulab/xnmt
△ Less
Submitted 28 February, 2018;
originally announced March 2018.
-
Bayesian Models for Unit Discovery on a Very Low Resource Language
Authors:
Lucas Ondel,
Pierre Godard,
Laurent Besacier,
Elin Larsen,
Mark Hasegawa-Johnson,
Odette Scharenborg,
Emmanuel Dupoux,
Lukas Burget,
François Yvon,
Sanjeev Khudanpur
Abstract:
Develo** speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show tha…
▽ More
Develo** speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show that Bayesian models can naturally integrate information from other resourceful languages by means of informative prior leading to more consistent discovered units. Finally, discovered acoustic units are used, either as the 1-best sequence or as a lattice, to perform word segmentation. Word segmentation results show that this Bayesian approach clearly outperforms a Segmental-DTW baseline on the same corpus.
△ Less
Submitted 20 February, 2018; v1 submitted 16 February, 2018;
originally announced February 2018.
-
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
Authors:
Odette Scharenborg,
Laurent Besacier,
Alan Black,
Mark Hasegawa-Johnson,
Florian Metze,
Graham Neubig,
Sebastian Stueker,
Pierre Godard,
Markus Mueller,
Lucas Ondel,
Shruti Palaskar,
Philip Arthur,
Francesco Ciannella,
Mingxing Du,
Elin Larsen,
Danny Merkx,
Rachid Riad,
Liming Wang,
Emmanuel Dupoux
Abstract:
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.
△ Less
Submitted 14 February, 2018;
originally announced February 2018.
-
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Authors:
P. Godard,
G. Adda,
M. Adda-Decker,
J. Benjumea,
L. Besacier,
J. Cooper-Leavitt,
G-N. Kouarata,
L. Lamel,
H. Maynard,
M. Mueller,
A. Rialland,
S. Stueker,
F. Yvon,
M. Zanon-Boito
Abstract:
Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation i…
▽ More
Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.
△ Less
Submitted 15 February, 2018; v1 submitted 10 October, 2017;
originally announced October 2017.