Skip to main content

Showing 1–28 of 28 results for author: Merrill, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13069  [pdf, other

    cs.CL cs.AI

    Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

    Authors: William Merrill, Noah A. Smith, Yanai Elazar

    Abstract: How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily… ▽ More

    Submitted 25 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: 8 page preprint + appendix. Minor fixes and appendix changes June 25, 2024

  2. arXiv:2404.15758  [pdf, other

    cs.CL cs.AI

    Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

    Authors: Jacob Pfau, William Merrill, Samuel R. Bowman

    Abstract: Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 17 pages, 10 figures

    ACM Class: I.2.6

  3. arXiv:2404.08819  [pdf, other

    cs.LG cs.CC cs.CL cs.FL

    The Illusion of State in State-Space Models

    Authors: William Merrill, Jackson Petty, Ashish Sabharwal

    Abstract: State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their cl… ▽ More

    Submitted 4 June, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

    Comments: To appear at ICML 2024. 9 pages + appendices

  4. arXiv:2402.13956  [pdf, other

    cs.CL

    Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment

    Authors: William Merrill, Zhaofeng Wu, Norihito Naka, Yoon Kim, Tal Linzen

    Abstract: Do LMs infer the semantics of text from co-occurrence patterns in their training data? Merrill et al. (2022) argue that, in theory, probabilities predicted by an optimal LM encode semantic information about entailment relations, but it is unclear whether neural LMs trained on corpora learn entailment in this way because of strong idealizing assumptions made by Merrill et al. In this work, we inves… ▽ More

    Submitted 29 February, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: Preprint

  5. arXiv:2402.00838  [pdf, other

    cs.CL

    OLMo: Accelerating the Science of Language Models

    Authors: Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam , et al. (18 additional authors not shown)

    Abstract: Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models… ▽ More

    Submitted 7 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  6. arXiv:2311.00208  [pdf, other

    cs.LG cs.CL cs.FL cs.LO

    What Formal Languages Can Transformers Express? A Survey

    Authors: Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin

    Abstract: As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work… ▽ More

    Submitted 6 May, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

  7. arXiv:2310.07923  [pdf, ps, other

    cs.LG cs.CC cs.CL cs.LO

    The Expressive Power of Transformers with Chain of Thought

    Authors: William Merrill, Ashish Sabharwal

    Abstract: Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad",… ▽ More

    Submitted 11 April, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: 9-page preprint. ICLR camera ready posted April 11

  8. arXiv:2305.13534  [pdf, other

    cs.CL

    How Language Model Hallucinations Can Snowball

    Authors: Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith

    Abstract: A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where C… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  9. arXiv:2303.11873  [pdf, other

    cs.LG

    A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks

    Authors: William Merrill, Nikolaos Tsilivis, Aman Shukla

    Abstract: Grokking is a phenomenon where a model trained on an algorithmic task first overfits but, then, after a large amount of additional training, undergoes a phase transition to generalize perfectly. We empirically study the internal structure of networks undergoing grokking on the sparse parity task, and find that the grokking phase transition corresponds to the emergence of a sparse subnetwork that d… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

    Comments: Published at the Workshop on Understanding Foundation Models at ICLR 2023

  10. arXiv:2210.07468  [pdf, other

    cs.CL

    Transparency Helps Reveal When Language Models Learn Meaning

    Authors: Zhaofeng Wu, William Merrill, Hao Peng, Iz Beltagy, Noah A. Smith

    Abstract: Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations (i.e., languages with strong transparency), both autoregressive and masked… ▽ More

    Submitted 4 March, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2023. Author's final version (pre-MIT Press publication)

  11. arXiv:2210.02671  [pdf, other

    cs.LG cs.CC

    A Logic for Expressing Log-Precision Transformers

    Authors: William Merrill, Ashish Sabharwal

    Abstract: One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformers can be equivalently expressed in a generalization of first-order logic. However, finite-precision transformers are a weak transformer variant because, as we show, a sin… ▽ More

    Submitted 6 November, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: May 24, 2023: Restructured version of old preprint. Oct 12, 2023: To appear at NeurIPS

  12. arXiv:2209.12407  [pdf, other

    cs.CL

    Entailment Semantics Can Be Extracted from an Ideal Language Model

    Authors: William Merrill, Alex Warstadt, Tal Linzen

    Abstract: Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that entailment judgments between sentences can be extracted from an ideal language model that has perfectly learned its target distribution, assuming the training sentences are generated by Gricean agents, i.e.,… ▽ More

    Submitted 8 January, 2024; v1 submitted 26 September, 2022; originally announced September 2022.

    Comments: Accepted at CONLL 2022. Updated Dec 4, 2023 and Jan 8, 2024 with erratum

  13. arXiv:2207.00729  [pdf, other

    cs.CC cs.CL

    The Parallelism Tradeoff: Limitations of Log-Precision Transformers

    Authors: William Merrill, Ashish Sabharwal

    Abstract: Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This… ▽ More

    Submitted 26 April, 2023; v1 submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted at TACL. Formerly entitled "Log-Precision Transformers are Constant-Depth Threshold Circuits". Updated with minor corrections in Section 2 (Implications) on March 6, 2023. Update with minor edits to the proof of Lemma 3 on April 26, 2023

  14. arXiv:2204.05991  [pdf, other

    cs.CV cs.CL

    ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

    Authors: Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach

    Abstract: Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP,… ▽ More

    Submitted 2 May, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: ACL 2022

  15. arXiv:2201.12451  [pdf, other

    cs.LG

    Extracting Finite Automata from RNNs Using State Merging

    Authors: William Merrill, Nikolaos Tsilivis

    Abstract: One way to interpret the behavior of a blackbox recurrent neural network (RNN) is to extract from it a more interpretable discrete computational model, like a finite state machine, that captures its behavior. In this work, we propose a new method for extracting finite automata from RNNs inspired by the state merging paradigm from grammatical inference. We demonstrate the effectiveness of our metho… ▽ More

    Submitted 13 April, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Preprint

  16. arXiv:2106.16213  [pdf, other

    cs.CL cs.CC cs.LG

    Saturated Transformers are Constant-Depth Threshold Circuits

    Authors: William Merrill, Ashish Sabharwal, Noah A. Smith

    Abstract: Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al. 2021). However, hard attention is a strong assumption, which may… ▽ More

    Submitted 10 April, 2022; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: To appear in TACL

  17. arXiv:2104.10809  [pdf, other

    cs.CL

    Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?

    Authors: William Merrill, Yoav Goldberg, Roy Schwartz, Noah A. Smith

    Abstract: Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever ``understand'' raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of ``assertions'': textual contexts… ▽ More

    Submitted 22 June, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

    Comments: Updated 06/22/21 with substantive changes. Accepted at TACL; pre-MIT Press publication version

  18. arXiv:2104.08646  [pdf, other

    cs.CL

    Competency Problems: On Finding and Removing Artifacts in Language Data

    Authors: Matt Gardner, William Merrill, Jesse Dodge, Matthew E. Peters, Alexis Ross, Sameer Singh, Noah A. Smith

    Abstract: Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a… ▽ More

    Submitted 28 December, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021. This version fixes an error in Proposition 1 and adds discussion (the EMNLP camera ready version is unfixed) (and v3 adds the acknowledgements that we forgot to put into v2)

  19. arXiv:2102.10094  [pdf, other

    cs.CL

    Formal Language Theory Meets Modern NLP

    Authors: William Merrill

    Abstract: NLP is deeply intertwined with the formal study of language, both conceptually and historically. Arguably, this connection goes all the way back to Chomsky's Syntactic Structures in 1957. It also still holds true today, with a strand of recent works building formal analysis of modern neural networks methods in terms of formal languages. In this document, I aim to explain background about formal la… ▽ More

    Submitted 26 July, 2021; v1 submitted 19 February, 2021; originally announced February 2021.

    Comments: 24 pages, tutorial document. Updated based on feedback received

  20. arXiv:2010.09697  [pdf, other

    cs.LG cs.CL

    Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

    Authors: William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, Noah Smith

    Abstract: The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ($\ell_2$ norm) during training, and its implications for the… ▽ More

    Submitted 7 March, 2023; v1 submitted 19 October, 2020; originally announced October 2020.

    Comments: Appeared at EMNLP 2021. March 7, 2023: Removed irreproducible numbers reported in a footnote with erratum note

  21. arXiv:2004.10706  [pdf, other

    cs.DL cs.CL

    CORD-19: The COVID-19 Open Research Dataset

    Authors: Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex Wade, Kuansan Wang, Nancy Xin Ru Wang, Chris Wilhelm, Boya Xie, Douglas Raymond , et al. (3 additional authors not shown)

    Abstract: The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the b… ▽ More

    Submitted 10 July, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

    Comments: ACL NLP-COVID Workshop 2020

  22. arXiv:2004.08500  [pdf, other

    cs.CL cs.FL

    A Formal Hierarchy of RNN Architectures

    Authors: William Merrill, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A. Smith, Eran Yahav

    Abstract: We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties: space complexity, which measures the RNN's memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. We place several RNN variants within this hierarchy. For example, we prove the LSTM is not rational, wh… ▽ More

    Submitted 19 September, 2020; v1 submitted 17 April, 2020; originally announced April 2020.

    Comments: To appear at ACL 2020. Updated to include computational cost estimates and updated experimental results (in an erratum appendix)

  23. arXiv:2004.06866  [pdf, ps, other

    cs.CL cs.FL

    On the Linguistic Capacity of Real-Time Counter Automata

    Authors: William Merrill

    Abstract: Counter machines have achieved a newfound relevance to the field of natural language processing (NLP): recent work suggests some strong-performing recurrent neural networks utilize their memory as counters. Thus, one potential way to understand the success of these networks is to revisit the theory of counter computation. Therefore, we study the abilities of real-time counter machines as formal gr… ▽ More

    Submitted 9 September, 2021; v1 submitted 14 April, 2020; originally announced April 2020.

    Comments: Updated to fix a minor typo in the semilinearity proof

  24. arXiv:1906.01661  [pdf, other

    cs.CL

    Detecting Syntactic Change Using a Neural Part-of-Speech Tagger

    Authors: William Merrill, Gigi Felice Stark, Robert Frank

    Abstract: We train a diachronic long short-term memory (LSTM) part-of-speech tagger on a large corpus of American English from the 19th, 20th, and 21st centuries. We analyze the tagger's ability to implicitly learn temporal structure between years, and the extent to which this knowledge can be transferred to date new sentences. The learned year embeddings show a strong linear correlation between their first… ▽ More

    Submitted 9 July, 2019; v1 submitted 4 June, 2019; originally announced June 2019.

    Comments: To appear in the proceedings of the Computational Approaches to Historical Language Change workshop at ACL 2019

  25. arXiv:1906.01615  [pdf, other

    cs.CL cs.FL cs.LG

    Sequential Neural Networks as Automata

    Authors: William Merrill

    Abstract: This work attempts to explain the types of computation that neural networks can perform by relating them to automata. We first define what it means for a real-time network with bounded precision to accept a language. A measure of network memory follows from this definition. We then characterize the classes of languages acceptable by various recurrent networks, attention, and convolutional networks… ▽ More

    Submitted 4 January, 2021; v1 submitted 4 June, 2019; originally announced June 2019.

    Comments: To appear in the proceedings of the Deep Learning and Formal Languages workshop at ACL 2019

  26. arXiv:1906.01594  [pdf, other

    cs.CL cs.LG cs.NE

    Finding Syntactic Representations in Neural Stacks

    Authors: William Merrill, Lenny Khazan, Noah Amsel, Yiding Hao, Simon Mendelsohn, Robert Frank

    Abstract: Neural network architectures have been augmented with differentiable stacks in order to introduce a bias toward learning hierarchy-sensitive regularities. It has, however, proven difficult to assess the degree to which such a bias is effective, as the operation of the differentiable stack is not always interpretable. In this paper, we attempt to detect the presence of latent representations of hie… ▽ More

    Submitted 4 June, 2019; originally announced June 2019.

    Comments: To appear in the Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

  27. arXiv:1809.02836  [pdf, other

    cs.NE cs.CL cs.LG

    Context-Free Transductions with Neural Stacks

    Authors: Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, Simon Mendelsohn

    Abstract: This paper analyzes the behavior of stack-augmented recurrent neural network (RNN) models. Due to the architectural similarity between stack RNNs and pushdown transducers, we train stack RNN models on a number of tasks, including string reversal, context-free language modelling, and cumulative XOR evaluation. Examining the behavior of our networks, we show that stack-augmented RNNs can discover in… ▽ More

    Submitted 8 September, 2018; originally announced September 2018.

    Comments: To appear in the proceedings of the Analyzing and Interpreting Neural Networks for NLP workshop at EMNLP 2018

  28. arXiv:1804.06610  [pdf, other

    cs.CL

    End-to-end Graph-based TAG Parsing with Neural Networks

    Authors: Jungo Kasai, Robert Frank, Pauli Xu, William Merrill, Owen Rambow

    Abstract: We present a graph-based Tree Adjoining Grammar (TAG) parser that uses BiLSTMs, highway connections, and character-level CNNs. Our best end-to-end parser, which jointly performs supertagging, POS tagging, and parsing, outperforms the previously reported best results by more than 2.2 LAS and UAS points. The graph-based parsing architecture allows for global inference and rich feature representation… ▽ More

    Submitted 27 April, 2018; v1 submitted 18 April, 2018; originally announced April 2018.

    Comments: NAACL 2018