Skip to main content

Showing 1–3 of 3 results for author: Schmidt, C W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.01289  [pdf, other

    cs.CL

    Greed is All You Need: An Evaluation of Tokenizer Inference Methods

    Authors: Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

    Abstract: While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary siz… ▽ More

    Submitted 31 May, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: ACL 2024 (main)

  2. arXiv:2402.18376  [pdf, other

    cs.CL cs.AI

    Tokenization Is More Than Compression

    Authors: Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

    Abstract: Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    MSC Class: 68T50 ACM Class: I.2.7

  3. arXiv:1902.09875  [pdf, other

    cs.CL

    Improving a tf-idf weighted document vector embedding

    Authors: Craig W. Schmidt

    Abstract: We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe. We describe two methods that can improve upon a simple weighted sum, that are optimal in the sense that they maximizes a particular weighted cosine similarity measure. We consider several weighting functions, including inverse document f… ▽ More

    Submitted 26 February, 2019; originally announced February 2019.