Skip to main content

Showing 1–14 of 14 results for author: Doğruöz, A S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.11125  [pdf, other

    cs.CL

    A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base

    Authors: Hasti Toossi, Guo Qing Huai, **yu Liu, Eric Khiu, A. Seza Doğruöz, En-Shiun Annie Lee

    Abstract: In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: NAACL 2024 SRW

  2. arXiv:2404.19442  [pdf, other

    cs.CL

    Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages

    Authors: David Ifeoluwa Adelani, A. Seza Doğruöz, Iyanuoluwa Shode, Anuoluwapo Aremu

    Abstract: Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese and Indigenous languages). Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent ea… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: Working paper

  3. arXiv:2404.18286  [pdf, other

    cs.CL

    Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages

    Authors: David Ifeoluwa Adelani, A. Seza Doğruöz, André Coneglian, Atul Kr. Ojha

    Abstract: Large Language Models are transforming NLP for a variety of tasks. However, how LLMs perform NLP tasks for low-resource languages (LRLs) is less explored. In line with the goals of the AmericasNLP workshop, we focus on 12 LRLs from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the part… ▽ More

    Submitted 30 April, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

    Comments: Accepted to the Americas NLP Workshop at NAACL 2024 (https://turing.iimas.unam.mx/americasnlp/2024_workshop.html)

  4. arXiv:2403.16668  [pdf, other

    cs.CL cs.SI

    Who is bragging more online? A large scale analysis of bragging in social media

    Authors: Mali **, Daniel Preoţiuc-Pietro, A. Seza Doğruöz, Nikolaos Aletras

    Abstract: Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  5. arXiv:2402.02633  [pdf, other

    cs.CL cs.LG

    Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity

    Authors: Eric Khiu, Hasti Toossi, David Anugraha, **yu Liu, Jiaxu Li, Juan Armando Parra Flores, Leandro Acros Roman, A. Seza Doğruöz, En-Shiun Annie Lee

    Abstract: Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the si… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: 13 pages, 5 figures, accepted to EACL 2024, findings

  6. arXiv:2310.20470  [pdf, other

    cs.CL

    Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

    Authors: A. Seza Doğruöz, Sunayana Sitaram, Zheng-Xin Yong

    Abstract: Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the ex… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings)

  7. Investigating Reproducibility at Interspeech Conferences: A Longitudinal and Comparative Perspective

    Authors: Mohammad Arvan, A. Seza Doğruöz, Natalie Parde

    Abstract: Reproducibility is a key aspect for scientific advancement across disciplines, and reducing barriers for open science is a focus area for the theme of Interspeech 2023. Availability of source code is one of the indicators that facilitates reproducibility. However, less is known about the rates of reproducibility at Interspeech conferences in comparison to other conferences in the field. In order t… ▽ More

    Submitted 29 August, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

  8. arXiv:2306.01584  [pdf, other

    cs.CL

    Learning from Partially Annotated Data: Example-aware Creation of Gap-filling Exercises for Language Learning

    Authors: Semere Kiros Bitew, Johannes Deleu, A. Seza Doğruöz, Chris Develder, Thomas Demeester

    Abstract: Since performing exercises (including, e.g., practice tests) forms a crucial component of learning, and creating such exercises requires non-trivial effort from the teacher, there is a great value in automatic exercise generation in digital tools in education. In this paper, we particularly focus on automatic creation of gapfilling exercises for language learning, specifically grammar exercises. S… ▽ More

    Submitted 15 June, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: 12 pages, Accepted in the 18th Workshop on Innovative Use of NLP for Building Educational Applications

  9. arXiv:2303.11708  [pdf, other

    cs.CL

    The Open-domain Paradox for Chatbots: Common Ground as the Basis for Human-like Dialogue

    Authors: Gabriel Skantze, A. Seza Doğruöz

    Abstract: There is a surge in interest in the development of open-domain chatbots, driven by the recent advancements of large language models. The "openness" of the dialogue is expected to be maximized by providing minimal information to the users about the common ground they can expect, including the presumed joint activity. However, evidence suggests that the effect is the opposite. Asking users to "just… ▽ More

    Submitted 28 July, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: Accepted at SIGDIAL 2023

  10. arXiv:2301.01967  [pdf, ps, other

    cs.CL

    A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies

    Authors: A. Seza Doğruöz, Sunayana Sitaram, Barbara E. Bullock, Almeida Jacqueline Toribio

    Abstract: The analysis of data in which multiple languages are represented has gained popularity among computational linguists in recent years. So far, much of this research focuses mainly on the improvement of computational methods and largely ignores linguistic and social aspects of C-S discussed across a wide range of languages within the long-established literature in linguistics. To fill this gap, we o… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Journal ref: In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)

  11. arXiv:2211.13560  [pdf, ps, other

    cs.CL

    How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation

    Authors: A. Seza Doğruöz, Gabriel Skantze

    Abstract: Open-domain chatbots are supposed to converse freely with humans without being restricted to a topic, task or domain. However, the boundaries and/or contents of open-domain conversations are not clear. To clarify the boundaries of "openness", we conduct two studies: First, we classify the types of "speech events" encountered in a chatbot evaluation data set (i.e., Meena by Google) and find that th… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

    Journal ref: In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2021), pages 392-402, Singapore

  12. Resources for Turkish Natural Language Processing: A critical survey

    Authors: Çağrı Çöltekin, A. Seza Doğruöz, Özlem Çetinoğlu

    Abstract: This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available linguistic resources, we present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turk… ▽ More

    Submitted 25 February, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Published in Language Resources and Evaluation

  13. arXiv:2203.05840  [pdf, other

    cs.CL

    Automatic Identification and Classification of Bragging in Social Media

    Authors: Mali **, Daniel Preoţiuc-Pietro, A. Seza Doğruöz, Nikolaos Aletras

    Abstract: Bragging is a speech act employed with the goal of constructing a favorable self-image through positive statements about oneself. It is widespread in daily communication and especially popular in social media, where users aim to build a positive image of their persona directly or indirectly. In this paper, we present the first large scale study of bragging in computational linguistics, building on… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022

  14. arXiv:1508.07544  [pdf, ps, other

    cs.CL

    Computational Sociolinguistics: A Survey

    Authors: Dong Nguyen, A. Seza Doğruöz, Carolyn P. Rosé, Franciska de Jong

    Abstract: Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL rese… ▽ More

    Submitted 6 April, 2016; v1 submitted 30 August, 2015; originally announced August 2015.

    Comments: To appear in Computational Linguistics. Accepted for publication: 18th February, 2016