Skip to main content

Showing 1–1 of 1 results for author: Alameddine, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.18376  [pdf, other

    cs.CL cs.AI

    Tokenization Is More Than Compression

    Authors: Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

    Abstract: Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    MSC Class: 68T50 ACM Class: I.2.7