Skip to main content

Showing 1–11 of 11 results for author: Blasi, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2110.06733  [pdf, other

    cs.CL

    Systematic Inequalities in Language Technology Performance across the World's Languages

    Authors: Damián Blasi, Antonios Anastasopoulos, Graham Neubig

    Abstract: Natural language processing (NLP) systems have become a central technology in communication, education, medicine, artificial intelligence, and many other domains of research and development. While the performance of NLP methods has grown enormously over the last decade, this progress has been restricted to a minuscule subset of the world's 6,500 languages. We introduce a framework for estimating t… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

  2. arXiv:2109.15000  [pdf, other

    cs.CL

    A surprisal--duration trade-off across and within the world's languages

    Authors: Tiago Pimentel, Clara Meister, Elizabeth Salesky, Simone Teufel, Damián Blasi, Ryan Cotterell

    Abstract: While there exist scores of natural languages, each with its unique features and idiosyncrasies, they all share a unifying theme: enabling human communication. We may thus reasonably predict that human cognition shapes how these languages evolve and are used. Assuming that the capacity to process information is roughly constant across human populations, we expect a surprisal--duration trade-off to… ▽ More

    Submitted 30 September, 2021; originally announced September 2021.

    Comments: Accepted for publication in EMNLP 2021. Code available in https://github.com/rycolab/surprisal-duration-tradeoff

  3. arXiv:2106.02289  [pdf, other

    cs.CL

    Modeling the Unigram Distribution

    Authors: Irene Nikkarinen, Tiago Pimentel, Damián E. Blasi, Ryan Cotterell

    Abstract: The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased pro… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: Irene Nikkarinen and Tiago Pimentel contributed equally to this work. Accepted to the findings of ACL 2021. Code available in https://github.com/irenenikk/modelling-unigram

  4. arXiv:2106.00877  [pdf, other

    cs.CL

    Evaluating Word Embeddings with Categorical Modularity

    Authors: Sílvia Casacuberta, Karina Halevy, Damián E. Blasi

    Abstract: We introduce categorical modularity, a novel low-resource intrinsic metric to evaluate word embedding quality. Categorical modularity is a graph modularity metric based on the $k$-nearest neighbor graph constructed with embedding vectors of words from a fixed set of semantic categories, in which the goal is to measure the proportion of words that have nearest neighbors within the same categories.… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: Accepted to Findings of ACL 2021 (Long Paper)

  5. arXiv:2104.14279  [pdf, other

    cs.CL

    How (Non-)Optimal is the Lexicon?

    Authors: Tiago Pimentel, Irene Nikkarinen, Kyle Mahowald, Ryan Cotterell, Damián Blasi

    Abstract: The map** of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf's law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world's languages. Despite their importance in sha** lexical s… ▽ More

    Submitted 30 April, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

    Comments: Tiago Pimentel and Irene Nikkarinen contributed equally to this work. Accepted at NAACL 2021. This is the camera ready version

  6. arXiv:2104.06474  [pdf, other

    cs.CL

    On the Interpretability and Significance of Bias Metrics in Texts: a PMI-based Approach

    Authors: Francisco Valentini, Germán Rosati, Damián Blasi, Diego Fernandez Slezak, Edgar Altszyler

    Abstract: In recent years, word embeddings have been widely used to measure biases in texts. Even if they have proven to be effective in detecting a wide variety of biases, metrics based on word embeddings lack transparency and interpretability. We analyze an alternative PMI-based metric to quantify biases in texts. It can be expressed as a function of conditional probabilities, which provides a simple inte… ▽ More

    Submitted 18 July, 2023; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: Camera Ready for ACL 2023 (main conference)

  7. arXiv:2104.06325  [pdf, other

    cs.CL

    Finding Concept-specific Biases in Form--Meaning Associations

    Authors: Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián Blasi

    Abstract: This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has been claimed (Blasi et al., 2016) that the word for "tongue" is more likely than chance to contain the phone [l]. By controlling for the influence of language fami… ▽ More

    Submitted 29 April, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL 2021. This is the camera ready version. Code is available in https://github.com/rycolab/form-meaning-associations

  8. arXiv:2010.02172  [pdf, other

    cs.CL

    Speakers Fill Lexical Semantic Gaps with Context

    Authors: Tiago Pimentel, Rowan Hall Maudslay, Damián Blasi, Ryan Cotterell

    Abstract: Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear -- resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word… ▽ More

    Submitted 28 May, 2024; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Camera ready version of EMNLP 2020 publication. Code is available in https://github.com/tpimentelms/lexical-ambiguity-in-context

  9. arXiv:2005.01204  [pdf, other

    cs.CL

    On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs

    Authors: Adina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, Hanna Wallach

    Abstract: We use large-scale corpora in six different gendered languages, along with tools from NLP and information theory, to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. For all six languages, we find that there is a statistically significant relationship. We also find that there are statistically significant relat… ▽ More

    Submitted 3 May, 2020; originally announced May 2020.

    Comments: 17 pages, 6 figures, 4 tables, TACL(a) final submission

  10. arXiv:1910.13497  [pdf, other

    cs.CL

    Quantifying the Semantic Core of Gender Systems

    Authors: Adina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, Hanna Wallach

    Abstract: Many of the world's languages employ grammatical gender on the lexeme. For example, in Spanish, the word for 'house' (casa) is feminine, whereas the word for 'paper' (papel) is masculine. To a speaker of a genderless language, this assignment seems to exist with neither rhyme nor reason. But is the assignment of inanimate nouns to grammatical genders truly arbitrary? We present the first large-sca… ▽ More

    Submitted 29 October, 2019; originally announced October 2019.

    Comments: 6 pages, 2 figures, accepted to EMNLP 2019

  11. arXiv:1906.05906  [pdf, other

    cs.CL

    Meaning to Form: Measuring Systematicity as Information

    Authors: Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell

    Abstract: A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram \textit{gl} have any systematic relationship to the meaning of words like \textit{glisten}, \textit{gleam} and \textit{gl… ▽ More

    Submitted 26 July, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: Accepted for publication at ACL 2019