Skip to main content

Showing 1–3 of 3 results for author: Uzan, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.13292  [pdf, other

    cs.CL cs.AI

    Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

    Authors: Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

    Abstract: The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

  2. arXiv:2403.01289  [pdf, other

    cs.CL

    Greed is All You Need: An Evaluation of Tokenizer Inference Methods

    Authors: Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

    Abstract: While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary siz… ▽ More

    Submitted 31 May, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: ACL 2024 (main)

  3. arXiv:2402.18376  [pdf, other

    cs.CL cs.AI

    Tokenization Is More Than Compression

    Authors: Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

    Abstract: Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    MSC Class: 68T50 ACM Class: I.2.7