Skip to main content

Showing 1–8 of 8 results for author: Doval, Y

Searching in archive cs. Search in all archives.
.
  1. On the performance of phonetic algorithms in microtext normalization

    Authors: Yerai Doval, Manuel Vilares, Jesús Vilares

    Abstract: User-generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non-standard microtexts into standard w… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: Accepted for publication in journal Expert Systems with Applications

    Journal ref: Expert Systems with Applications, Volume 113, 2018, Pages 213-222

  2. arXiv:2005.08340  [pdf, other

    cs.CL

    Cross-Lingual Word Embeddings for Turkic Languages

    Authors: Elmurod Kuriyozov, Yerai Doval, Carlos Gómez-Rodríguez

    Abstract: There has been an increasing interest in learning cross-lingual word embeddings to transfer knowledge obtained from a resource-rich language, such as English, to lower-resource languages for which annotated data is scarce, such as Turkish, Russian, and many others. In this paper, we present the first viability study of established techniques to align monolingual embedding spaces for Turkish, Uzbek… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

    Comments: Final version, published in the proceedings of LREC 2020

    MSC Class: 68T50; 91F20 ACM Class: I.2.7

    Journal ref: Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4047-4055

  3. arXiv:1911.10876  [pdf, other

    cs.CL

    Towards robust word embeddings for noisy texts

    Authors: Yerai Doval, Jesús Vilares, Carlos Gómez-Rodríguez

    Abstract: Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to str… ▽ More

    Submitted 30 September, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: 15 pages, 2 figures, 5 tables

  4. arXiv:1910.07221  [pdf, other

    cs.CL

    Meemi: A Simple Method for Post-processing and Integrating Cross-lingual Word Embeddings

    Authors: Yerai Doval, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddi… ▽ More

    Submitted 11 November, 2020; v1 submitted 16 October, 2019; originally announced October 2019.

    Comments: 22 pages, 2 figures, 9 tables. Preprint submitted to Natural Language Engineering

    MSC Class: 68T50

  5. arXiv:1908.07742  [pdf, other

    cs.CL

    On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

    Authors: Yerai Doval, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings by aligning monolingual spaces have shown that accurate alignments can be obtained with little or no supervision. However, the focus has been on a particular con… ▽ More

    Submitted 3 March, 2020; v1 submitted 21 August, 2019; originally announced August 2019.

    Comments: 11 pages, 2 figures, 7 tables. Camera-ready submitted to LREC 2020

  6. arXiv:1905.07358  [pdf, other

    cs.CL cs.SI

    Learning Cross-lingual Embeddings from Twitter via Distant Supervision

    Authors: Jose Camacho-Collados, Yerai Doval, Eugenio Martínez-Cámara, Luis Espinosa-Anke, Francesco Barbieri, Steven Schockaert

    Abstract: Cross-lingual embeddings represent the meaning of words from different languages in the same vector space. Recent work has shown that it is possible to construct such representations by aligning independently learned monolingual embedding spaces, and that accurate alignments can be obtained even without external bilingual data. In this paper we explore a research direction that has been surprising… ▽ More

    Submitted 31 March, 2020; v1 submitted 17 May, 2019; originally announced May 2019.

    Comments: Accepted to ICWSM 2020. 11 pages, 1 appendix. Pre-trained embeddings available at https://github.com/pedrada88/crossembeddings-twitter

  7. Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

    Authors: Yerai Doval, Carlos Gómez-Rodríguez

    Abstract: Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting sy… ▽ More

    Submitted 3 December, 2018; originally announced December 2018.

    Comments: 11 pages, 4 figures, 5 tables, accepted in Journal of the Association for Information Science and Technology

    Journal ref: Volume 69, Issue 11, 2018, 11 pages

  8. arXiv:1808.08780  [pdf, other

    cs.CL

    Improving Cross-Lingual Word Embeddings by Meeting in the Middle

    Authors: Yerai Doval, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: Cross-lingual word embeddings are becoming increasingly important in multilingual NLP. Recently, it has been shown that these embeddings can be effectively learned by aligning two disjoint monolingual vector spaces through linear transformations, using no more than a small bilingual dictionary as supervision. In this work, we propose to apply an additional transformation after the initial alignmen… ▽ More

    Submitted 27 August, 2018; originally announced August 2018.

    Comments: 11 pages, 4 tables, 1 figure. EMNLP 2018 camera-ready