Skip to main content

Showing 1–13 of 13 results for author: Goldman, O

.
  1. arXiv:2407.00402  [pdf, other

    cs.CL cs.AI

    Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

    Authors: Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty

    Abstract: Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarizatio… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  2. arXiv:2403.06265  [pdf, other

    cs.CL cs.AI cs.LG

    Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

    Authors: Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, Reut Tsarfaty

    Abstract: Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream… ▽ More

    Submitted 22 June, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

    Comments: EMNLP 2024, Findings

  3. arXiv:2311.00658  [pdf, other

    cs.CL

    Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew

    Authors: Eylon Gueta, Omer Goldman, Reut Tsarfaty

    Abstract: Pre-trained language models (PLMs) have shown remarkable successes in acquiring a wide range of linguistic knowledge, relying solely on self-supervised training on text streams. Nevertheless, the effectiveness of this language-agnostic approach has been frequently questioned for its sub-optimal performance when applied to morphologically-rich languages (MRLs). We investigate the hypothesis that in… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  4. arXiv:2310.15905  [pdf, other

    cs.CL cs.AI cs.LG

    Is Probing All You Need? Indicator Tasks as an Alternative to Probing Embedding Spaces

    Authors: Tal Levy, Omer Goldman, Reut Tsarfaty

    Abstract: The ability to identify and control different kinds of linguistic information encoded in vector representations of words has many use cases, especially for explainability and bias removal. This is usually done via a set of simple classification tasks, termed probes, to evaluate the information encoded in the embedding space. However, the involvement of a trainable classifier leads to entanglement… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Findings of EMNLP 2023

  5. arXiv:2310.11877  [pdf, other

    cs.CL

    The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models

    Authors: Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, Shauli Ravfogel

    Abstract: Large language models (LLMs) have been shown to possess impressive capabilities, while also raising crucial concerns about the faithfulness of their responses. A primary issue arising in this context is the management of (un)answerable queries by LLMs, which often results in hallucinatory behavior due to overconfidence. In this paper, we explore the behavior of LLMs when presented with (un)answera… ▽ More

    Submitted 12 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  6. arXiv:2306.12581  [pdf, other

    cs.CL

    Morphological Inflection with Phonological Features

    Authors: David Guriel, Omer Goldman, Reut Tsarfaty

    Abstract: Recent years have brought great advances into solving morphological tasks, mostly due to powerful neural models applied to various tasks as (re)inflection and analysis. Yet, such morphological tasks cannot be considered solved, especially when little training data is available or when generalizing to previously unseen lemmas. This work explores effects on performance obtained through various ways… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: ACL 2023 main conference; 8 pages, 1 figure

  7. arXiv:2305.10160  [pdf, other

    cs.CL cs.AI

    Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

    Authors: Alon Jacovi, Avi Caciularu, Omer Goldman, Yoav Goldberg

    Abstract: Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination. Strategies such as leaderboards with hidden answers, or using test data which is guaranteed to be unseen, are expensive and become fragile… ▽ More

    Submitted 18 October, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP 2023

  8. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  9. arXiv:2203.08527  [pdf, other

    cs.CL

    Morphological Reinflection with Multiple Arguments: An Extended Annotation schema and a Georgian Case Study

    Authors: David Guriel, Omer Goldman, Reut Tsarfaty

    Abstract: In recent years, a flurry of morphological datasets had emerged, most notably UniMorph, a multi-lingual repository of inflection tables. However, the flat structure of the current morphological annotation schema makes the treatment of some languages quirky, if not impossible, specifically in cases of polypersonal agreement, where verbs agree with multiple arguments using true affixes. In this pape… ▽ More

    Submitted 20 March, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  10. arXiv:2202.12832  [pdf, other

    cs.CL

    Morphology Without Borders: Clause-Level Morphology

    Authors: Omer Goldman, Reut Tsarfaty

    Abstract: Morphological tasks use large multi-lingual datasets that organize words into inflection tables, which then serve as training and evaluation data for various tasks. However, a closer inspection of these data reveals profound cross-linguistic inconsistencies, that arise from the lack of a clear linguistic and operational definition of what is a word, and that severely impair the universality of the… ▽ More

    Submitted 19 October, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

    Comments: To appear on TACL

  11. arXiv:2108.05682  [pdf, other

    cs.CL

    (Un)solving Morphological Inflection: Lemma Overlap Artificially Inflates Models' Performance

    Authors: Omer Goldman, David Guriel, Reut Tsarfaty

    Abstract: In the domain of Morphology, Inflection is a fundamental and important task that gained a lot of traction in recent years, mostly via SIGMORPHON's shared-tasks. With average accuracy above 0.9 over the scores of all languages, the task is considered mostly solved using relatively generic neural seq2seq models, even with little data provided. In this work, we propose to re-evaluate morphological in… ▽ More

    Submitted 20 March, 2022; v1 submitted 12 August, 2021; originally announced August 2021.

    Comments: ACL 2022

  12. arXiv:2104.08512  [pdf, other

    cs.CL

    Minimal Supervision for Morphological Inflection

    Authors: Omer Goldman, Reut Tsarfaty

    Abstract: Neural models for the various flavours of morphological inflection tasks have proven to be extremely accurate given ample labeled data -- data that may be slow and costly to obtain. In this work we aim to overcome this annotation bottleneck by bootstrap** labeled data from a seed as little as {\em five} labeled paradigms, accompanied by a large bulk of unlabeled text. Our approach exploits diffe… ▽ More

    Submitted 12 October, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021

  13. arXiv:1711.05240  [pdf, other

    cs.CL cs.AI cs.LG

    Weakly-supervised Semantic Parsing with Abstract Examples

    Authors: Omer Goldman, Veronica Latcinnik, Udi Naveh, Amir Globerson, Jonathan Berant

    Abstract: Training semantic parsers from weak supervision (denotations) rather than strong supervision (programs) complicates training in two ways. First, a large search space of potential programs needs to be explored at training time to find a correct program. Second, spurious programs that accidentally lead to a correct denotation add noise to training. In this work we propose that in closed worlds with… ▽ More

    Submitted 13 March, 2019; v1 submitted 14 November, 2017; originally announced November 2017.

    Comments: CNLVR,NLVR. Accepted to ACL 2018