Search | arXiv e-print repository

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

Authors: Jiatong Shi, Jonathan D. Amith, Rey Castillo García, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

Abstract: "Transcription bottlenecks", created by a shortage of effective human transcribers are one of the main challenges to endangered language (EL) documentation. Automatic speech recognition (ASR) has been suggested as a tool to overcome such bottlenecks. Following this suggestion, we investigated the effectiveness for EL documentation of end-to-end ASR, which unlike Hidden Markov Model ASR systems, es… ▽ More "Transcription bottlenecks", created by a shortage of effective human transcribers are one of the main challenges to endangered language (EL) documentation. Automatic speech recognition (ASR) has been suggested as a tool to overcome such bottlenecks. Following this suggestion, we investigated the effectiveness for EL documentation of end-to-end ASR, which unlike Hidden Markov Model ASR systems, eschews linguistic resources but is instead more dependent on large-data settings. We open source a Yoloxóchitl Mixtec EL corpus. First, we review our method in building an end-to-end ASR system in a way that would be reproducible by the ASR community. We then propose a novice transcription correction task and demonstrate how ASR systems and novice transcribers can work together to improve EL documentation. We believe this combinatory methodology would mitigate the transcription bottleneck and transcriber shortage that hinders EL documentation. △ Less

Submitted 5 March, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

Comments: Accepted by EACL2021

arXiv:1806.04291 [pdf, ps, other]

Challenges of language technologies for the indigenous languages of the Americas

Authors: Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, Ivan Meza

Abstract: Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the digital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We w… ▽ More Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the digital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We would like to encourage NLP research in linguistically rich and diverse areas like the Americas. △ Less

Submitted 11 June, 2018; originally announced June 2018.

Comments: In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)

arXiv:1710.06524 [pdf, ps, other]

Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF

Authors: Ignacio Arroyo-Fernández, Carlos-Francisco Méndez-Cruz, Gerardo Sierra, Juan-Manuel Torres-Moreno, Grigori Sidorov

Abstract: Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sen… ▽ More Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon's word entropies provided by the Term Frequency--Inverse Document Frequency (TF--IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems. △ Less

Submitted 19 October, 2017; v1 submitted 17 October, 2017; originally announced October 2017.

arXiv:1703.03923 [pdf, other]

A German Corpus for Text Similarity Detection Tasks

Authors: Juan-Manuel Torres-Moreno, Gerardo Sierra, Peter Peinl

Abstract: Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts an… ▽ More Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions. △ Less

Submitted 11 March, 2017; originally announced March 2017.

Comments: 1 figure; 13 pages

Journal ref: Preprint of International Journal of Computational Linguistics and Applications, vol. 5, no. 2, 2014, pp. 9-24

arXiv:1702.06467 [pdf, other]

Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization

Authors: Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo Sierra

Abstract: In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, $n$-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character floodi… ▽ More In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, $n$-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Experiments with SVM showed up to 90% of performance. △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: 8 pages, 6 figures, Conference paper

Journal ref: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Vol 1: KDIR, 307-314, 2016, Porto, Portugal

arXiv:1501.04920 [pdf]

Regroupement sémantique de définitions en espagnol

Authors: Gerardo Sierra, Juan-Manuel Torres-Moreno, Alejandro Molina

Abstract: This article focuses on the description and evaluation of a new unsupervised learning method of clustering of definitions in Spanish according to their semantic. Textual Energy was used as a clustering measure, and we study an adaptation of the Precision and Recall to evaluate our method. This article focuses on the description and evaluation of a new unsupervised learning method of clustering of definitions in Spanish according to their semantic. Textual Energy was used as a clustering measure, and we study an adaptation of the Precision and Recall to evaluate our method. △ Less

Submitted 20 January, 2015; originally announced January 2015.

Comments: 11 pages, in French, 5 figures. Workshop Evaluation des méthodes d'Extraction de Connaissances dans les Données EvalECD EGC'10, 2010 Tunis

arXiv:1212.3493 [pdf, ps, other]

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Authors: Alejandro Molina, Juan-Manuel Torres-Moreno, Iria da Cunha, Eric SanJuan, Gerardo Sierra

Abstract: Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical comp… ▽ More Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical compressed sentences. The approach is simple enough to be transposed into other languages. △ Less

Submitted 17 December, 2012; v1 submitted 14 December, 2012; originally announced December 2012.

Comments: 7 pages, 3 tables

Showing 1–7 of 7 results for author: Sierra, G