Skip to main content

Showing 1–10 of 10 results for author: Samih, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.17342  [pdf, other

    cs.CL cs.AI

    Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

    Authors: Teresa Lynn, Malik H. Altakrori, Samar Mohamed Magdy, Rocktim Jyoti Das, Chenyang Lyu, Mohamed Nasr, Younes Samih, Alham Fikri Aji, Preslav Nakov, Shantanu Godbole, Salim Roukos, Radu Florian, Nizar Habash

    Abstract: The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: Paper 8 pages, Appendix 12 pages. Submitted to ARR

  2. arXiv:2311.07497  [pdf, other

    cs.CL

    Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure

    Authors: David Arps, Laura Kallmeyer, Younes Samih, Hassan Sajjad

    Abstract: We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use ca… ▽ More

    Submitted 12 June, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: NAACL 2024. Our software is available at https://github.com/davidarps/spud

  3. arXiv:2204.06201  [pdf, other

    cs.CL

    Probing for Constituency Structure in Neural Language Models

    Authors: David Arps, Younes Samih, Laura Kallmeyer, Hassan Sajjad

    Abstract: In this paper, we investigate to which extent contextual neural language models (LMs) implicitly learn syntactic structure. More concretely, we focus on constituent structure as represented in the Penn Treebank (PTB). Using standard probing techniques based on diagnostic classifiers, we assess the accuracy of representing constituents of different categories within the neuron activations of a LM s… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

    Comments: 20 pages, 9 Figures, 9 tables

  4. arXiv:2111.09574  [pdf, other

    cs.CL

    Automatic Expansion and Retargeting of Arabic Offensive Language Training

    Authors: Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Younes Samih

    Abstract: Rampant use of offensive language on social media led to recent efforts on automatic identification of such language. Though offensive language has general characteristics, attacks on specific entities may exhibit distinct phenomena such as malicious alterations in the spelling of names. In this paper, we present a method for identifying entity specific offensive language. We employ two key insigh… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

  5. arXiv:2102.10684  [pdf, other

    cs.CL cs.AI

    Pre-Training BERT on Arabic Tweets: Practical Considerations

    Authors: Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish, Younes Samih

    Abstract: Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the effi… ▽ More

    Submitted 21 February, 2021; originally announced February 2021.

    Comments: 6 pages, 5 figures

  6. arXiv:2005.06557  [pdf, other

    cs.CL

    Arabic Dialect Identification in the Wild

    Authors: Ahmed Abdelali, Hamdy Mubarak, Younes Samih, Sabit Hassan, Kareem Darwish

    Abstract: We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects -covering 18 different countries in the Middle East and North Africa region. Our method for building this dataset relies on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either w… ▽ More

    Submitted 15 May, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

    Comments: 13 pages, 7 figures, 4 tables

  7. arXiv:2004.03485  [pdf, other

    cs.SI cs.CL

    A Few Topical Tweets are Enough for Effective User-Level Stance Detection

    Authors: Younes Samih, Kareem Darwish

    Abstract: Stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can yield very high accuracy (+98%). However, such methods perform poorly or fail completely for less vocal users, who may have aut… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

  8. arXiv:2004.02192  [pdf, other

    cs.CL

    Arabic Offensive Language on Twitter: Analysis and Experiments

    Authors: Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, Ahmed Abdelali

    Abstract: Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building a large Arabic offensive tweet dataset. We introduce a method for building a dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech. We… ▽ More

    Submitted 9 March, 2021; v1 submitted 5 April, 2020; originally announced April 2020.

    Comments: 10 pages, 6 figures, 3 tables

  9. arXiv:1810.06619  [pdf, other

    cs.CL

    Diacritization of Maghrebi Arabic Sub-Dialects

    Authors: Ahmed Abdelali, Mohammed Attia, Younes Samih, Kareem Darwish, Hamdy Mubarak

    Abstract: Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automa… ▽ More

    Submitted 30 May, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

    Comments: 6 pages, 3 figures

  10. arXiv:1708.05891  [pdf, other

    cs.CL

    Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

    Authors: Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, Kallmeyer Laura

    Abstract: Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the p… ▽ More

    Submitted 19 August, 2017; originally announced August 2017.