Search | arXiv e-print repository

Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?

Authors: Zhiqiang Pi, Annapurna Vadaparty, Benjamin K. Bergen, Cameron R. Jones

Abstract: Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task (Kosinski, 2023), others have argued that LLMs solve these tasks by exploiting spurious correlations -- not representing beliefs -- since they fail on trivial alterations to… ▽ More Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task (Kosinski, 2023), others have argued that LLMs solve these tasks by exploiting spurious correlations -- not representing beliefs -- since they fail on trivial alterations to these tasks (Ullman, 2023). In this paper, we introduce SCALPEL: a technique to generate targeted modifications for False Belief tasks to test different specific hypotheses about why LLMs fail. We find that modifications which make explicit common inferences -- such as that looking at a transparent object implies recognizing its contents -- preserve LLMs' performance. This suggests that LLMs' failures on modified ToM tasks could result from a lack of more general commonsense reasoning, rather than a failure to represent mental states. We argue that SCALPEL could be helpful for explaining LLM successes and failures in other cases. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2405.08007 [pdf, other]

People cannot distinguish GPT-4 from a human in a Turing test

Authors: Cameron R. Jones, Benjamin K. Bergen

Abstract: We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first… ▽ More We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 23 pages, 13 figures

arXiv:2404.19178 [pdf, other]

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

Authors: James A. Michaelov, Catherine Arnett, Benjamin K. Bergen

Abstract: Transformers have supplanted Recurrent Neural Networks as the dominant architecture for both natural language processing tasks and, despite criticisms of cognitive implausibility, for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent neural network architectures, RWKV and Mamba, appear to perform natural language tasks comparab… ▽ More Transformers have supplanted Recurrent Neural Networks as the dominant architecture for both natural language processing tasks and, despite criticisms of cognitive implausibility, for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent neural network architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2403.00686 [pdf, other]

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Authors: Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

Abstract: How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use li… ▽ More How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2311.09205 [pdf, other]

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

Authors: Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

Abstract: Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We… ▽ More Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.09194 [pdf, other]

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Authors: James A. Michaelov, Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

Abstract: Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is proc… ▽ More Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is processed and produced more readily. Because confounds exist when using stimuli in a single language, evidence of abstraction is even more compelling from crosslingual structural priming, where use of a syntactic structure in one language primes an analogous structure in another language. We measure crosslingual structural priming in large language models, comparing model behavior to human experimental results from eight crosslingual experiments covering six languages, and four monolingual structural priming experiments in three non-English languages. We find evidence for abstract monolingual and crosslingual grammatical representations in the models that function similarly to those found in humans. These results demonstrate that grammatical representations in multilingual language models are not only similar across languages, but they can causally influence text produced in different languages. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP 2023

arXiv:2310.20216 [pdf, other]

Does GPT-4 pass the Turing test?

Authors: Cameron R. Jones, Benjamin K. Bergen

Abstract: We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient… ▽ More We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness. △ Less

Submitted 20 April, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

Comments: 28 pages, 21 figures

arXiv:2310.07929 [pdf, other]

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

Authors: Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen

Abstract: Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-trainin… ▽ More Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training. We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language. We discuss implications for data contamination, low-resource transfer, and how abstract grammatical representations emerge in multilingual models. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Extended abstract accepted to the 3rd Multilingual Representation Learning workshop at EMNLP 2023

arXiv:2308.15419 [pdf, other]

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

Authors: Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen

Abstract: How do language models learn to make predictions during pre-training? To study this question, we extract learning curves from five autoregressive English language model pre-training runs, for 1M tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We quantify the final surprisal, within-run variability,… ▽ More How do language models learn to make predictions during pre-training? To study this question, we extract learning curves from five autoregressive English language model pre-training runs, for 1M tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context. More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training. Higher n-gram probabilities further accentuate these effects. Independent of the target token, shorter and more frequent contexts correlate with marginally more stable and quickly acquired predictions. Effects of part-of-speech are also small, although nouns tend to be acquired later and less stably than verbs, adverbs, and adjectives. Our work contributes to a better understanding of language model pre-training dynamics and informs the deployment of stable language models in practice. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2305.14681 [pdf, other]

Emergent inabilities? Inverse scaling over the course of pretraining

Authors: James A. Michaelov, Benjamin K. Bergen

Abstract: Does inverse scaling only occur as a function of model size, or can it also occur over the course of training? We carry out an exploratory study investigating whether the performance of language models on specific tasks can decrease (while general performance remains high) during training on the language modeling task. We find 8 tasks on which Pythia 12B (Biderman et al., 2023) shows decreased per… ▽ More Does inverse scaling only occur as a function of model size, or can it also occur over the course of training? We carry out an exploratory study investigating whether the performance of language models on specific tasks can decrease (while general performance remains high) during training on the language modeling task. We find 8 tasks on which Pythia 12B (Biderman et al., 2023) shows decreased performance over the course of training. Five of these tasks (TruthfulQA-MC1, TruthfulQA-MC2, Hindsight Neglect, Memo Trap, and Pattern Match Suppression) additionally show a consistent relationship whereby larger language models show a greater decrease in performance the more they are trained, despite showing standard (positive) scaling overall. This highlights the importance of testing performance at all relevant benchmarks any time models are trained on additional data, even if their overall performance improves △ Less

Submitted 15 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted to Findings of EMNLP 2023

arXiv:2303.11504 [pdf, ps, other]

Language Model Behavior: A Comprehensive Survey

Authors: Tyler A. Chang, Benjamin K. Bergen

Abstract: Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sen… ▽ More Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. Many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. We synthesize recent results to highlight what is currently known about large language model capabilities, thus providing a resource for applied work and for research in adjacent fields that use language models. △ Less

Submitted 25 August, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: 32 pages, accepted to Computational Linguistics

arXiv:2301.08731 [pdf, other]

Can Peanuts Fall in Love with Distributional Semantics?

Authors: James A. Michaelov, Seana Coulson, Benjamin K. Bergen

Abstract: Context changes expectations about upcoming words - following a story involving an anthropomorphic peanut, comprehenders expect the sentence the peanut was in love more than the peanut was salted, as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This updating of expectations has been explained using Situation Models - mental representations of a described event. However, recent work sh… ▽ More Context changes expectations about upcoming words - following a story involving an anthropomorphic peanut, comprehenders expect the sentence the peanut was in love more than the peanut was salted, as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This updating of expectations has been explained using Situation Models - mental representations of a described event. However, recent work showing that N400 amplitude is predictable from distributional information alone raises the question whether situation models are necessary for these contextual effects. We model the results of Nieuwland and van Berkum (2006) using six computational language models and three sets of word vectors, none of which have explicit situation models or semantic grounding. We find that a subset of these can fully model the effect found by Nieuwland and van Berkum (2006). Thus, at least some processing effects normally explained through situation models may not in fact require explicit situation models. △ Less

Submitted 22 May, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

Comments: Accepted at CogSci 2023

arXiv:2212.08700 [pdf, other]

Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers

Authors: James A. Michaelov, Benjamin K. Bergen

Abstract: How well do language models deal with quantification? In this study, we focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models because the sentence components with out the quantifier are likely to co-occur, and 'few'-type quantifiers are rare. We present 960 English sentence stimuli from two human neurolinguistic experiments to… ▽ More How well do language models deal with quantification? In this study, we focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models because the sentence components with out the quantifier are likely to co-occur, and 'few'-type quantifiers are rare. We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do all the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. This inverse scaling is consistent with previous work suggesting that larger models increasingly reflect online rather than offline human processing, and we argue that the decreasing performance of larger models may challenge uses of language models as the basis for natural language systems. △ Less

Submitted 26 May, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

Comments: Accepted to Findings of ACL 2023

arXiv:2211.05198 [pdf, other]

Collateral facilitation in humans and language models

Authors: James A. Michaelov, Benjamin K. Bergen

Abstract: Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding… ▽ More Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding context or to the most probable continuation. Using stimuli from 3 psycholinguistic experiments, we find that this is also almost always also the case for 8 contemporary transformer language models (BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM). We then discuss the implications of this phenomenon for our understanding of both human language comprehension and the predictions made by language models. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: Accepted at CoNLL 2022

arXiv:2208.14554 [pdf, other]

Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Authors: James A. Michaelov, Benjamin K. Bergen

Abstract: Some languages allow arguments to be omitted in certain contexts. Yet human language comprehenders reliably infer the intended referents of these zero pronouns, in part because they construct expectations about which referents are more likely. We ask whether Neural Language Models also extract the same expectations. We test whether 12 contemporary language models display expectations that reflect… ▽ More Some languages allow arguments to be omitted in certain contexts. Yet human language comprehenders reliably infer the intended referents of these zero pronouns, in part because they construct expectations about which referents are more likely. We ask whether Neural Language Models also extract the same expectations. We test whether 12 contemporary language models display expectations that reflect human behavior when exposed to sentences with zero pronouns from five behavioral experiments conducted in Italian by Carminati (2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the human behavior from all the experiments, with others successfully modeling some of the results. This result suggests that human expectations about coreference can be derived from exposure to language, and also indicates features of language models that allow them to better reflect human behavior. △ Less

Submitted 3 October, 2022; v1 submitted 30 August, 2022; originally announced August 2022.

Comments: Accepted at COLING 2022

arXiv:2205.10964 [pdf, other]

The Geometry of Multilingual Language Model Representations

Authors: Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen

Abstract: We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language. Using XLM-R as a case study, we show that languages occupy similar linear subspaces after mean-centering, evaluated based on causal effects on language modeling performance and direct comparisons between subspaces for 88 languages. The… ▽ More We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language. Using XLM-R as a case study, we show that languages occupy similar linear subspaces after mean-centering, evaluated based on causal effects on language modeling performance and direct comparisons between subspaces for 88 languages. The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies. Shifting representations by language means is sufficient to induce token predictions in different languages. However, we also identify stable language-neutral axes that encode information such as token positions and part-of-speech. We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information. These results demonstrate that multilingual language models encode information along orthogonal language-sensitive and language-neutral axes, allowing the models to extract a variety of features for downstream tasks and cross-lingual transfer learning. △ Less

Submitted 21 October, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

Comments: Accepted to EMNLP 2022

arXiv:2110.02406 [pdf, other]

Word Acquisition in Neural Language Models

Authors: Tyler A. Chang, Benjamin K. Bergen

Abstract: We investigate how neural language models acquire individual words during training, extracting learning curves and ages of acquisition for over 600 words on the MacArthur-Bates Communicative Development Inventory (Fenson et al., 2007). Drawing on studies of word acquisition in children, we evaluate multiple predictors for words' ages of acquisition in LSTMs, BERT, and GPT-2. We find that the effec… ▽ More We investigate how neural language models acquire individual words during training, extracting learning curves and ages of acquisition for over 600 words on the MacArthur-Bates Communicative Development Inventory (Fenson et al., 2007). Drawing on studies of word acquisition in children, we evaluate multiple predictors for words' ages of acquisition in LSTMs, BERT, and GPT-2. We find that the effects of concreteness, word length, and lexical class are pointedly different in children and language models, reinforcing the importance of interaction and sensorimotor experience in child language acquisition. Language models rely far more on word frequency than children, but like children, they exhibit slower learning of words in longer utterances. Interestingly, models follow consistent patterns during training for both unidirectional and bidirectional models, and for both LSTM and Transformer architectures. Models predict based on unigram token frequencies early in training, before transitioning loosely to bigram probabilities, eventually converging on more nuanced predictions. These results shed light on the role of distributional learning mechanisms in children, while also providing insights for more human-like language acquisition in language models. △ Less

Submitted 5 October, 2021; originally announced October 2021.

Comments: Accepted to TACL (pre-MIT Press version)

arXiv:2109.01226 [pdf, other]

doi 10.1109/TCDS.2022.3176783

So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements

Authors: James A. Michaelov, Seana Coulson, Benjamin K. Bergen

Abstract: More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the pr… ▽ More More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought. △ Less

Submitted 25 May, 2022; v1 submitted 2 September, 2021; originally announced September 2021.

Comments: Accepted

Journal ref: IEEE Transactions on Cognitive and Developmental Systems (2022)

arXiv:2107.09648 [pdf, other]

Different kinds of cognitive plausibility: why are transformers better than RNNs at predicting N400 amplitude?

Authors: James A. Michaelov, Megan D. Bardolph, Seana Coulson, Benjamin K. Bergen

Abstract: Despite being designed for performance rather than cognitive plausibility, transformer language models have been found to be better at predicting metrics used to assess human language comprehension than language models with other architectures, such as recurrent neural networks. Based on how well they predict the N400, a neural signal associated with processing difficulty, we propose and provide e… ▽ More Despite being designed for performance rather than cognitive plausibility, transformer language models have been found to be better at predicting metrics used to assess human language comprehension than language models with other architectures, such as recurrent neural networks. Based on how well they predict the N400, a neural signal associated with processing difficulty, we propose and provide evidence for one possible explanation - their predictions are affected by the preceding context in a way analogous to the effect of semantic facilitation in humans. △ Less

Submitted 20 July, 2021; originally announced July 2021.

Journal ref: Proceedings of the 43rd Annual Meeting of the Cognitive Science Society (2021) 300-306

arXiv:2010.04844 [pdf, other]

doi 10.18653/v1/2020.conll-1.53

How well does surprisal explain N400 amplitude under different experimental conditions?

Authors: James A. Michaelov, Benjamin K. Bergen

Abstract: We investigate the extent to which word surprisal can be used to predict a neural measure of human language processing difficulty - the N400. To do this, we use recurrent neural networks to calculate the surprisal of stimuli from previously published neurolinguistic studies of the N400. We find that surprisal can predict N400 amplitude in a wide range of cases, and the cases where it cannot do so… ▽ More We investigate the extent to which word surprisal can be used to predict a neural measure of human language processing difficulty - the N400. To do this, we use recurrent neural networks to calculate the surprisal of stimuli from previously published neurolinguistic studies of the N400. We find that surprisal can predict N400 amplitude in a wide range of cases, and the cases where it cannot do so provide valuable insight into the neurocognitive processes underlying the response. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: To be presented at CoNLL 2020

Journal ref: Proceedings of the 24th Conference on Computational Natural Language Learning (2020) 652-663

arXiv:2007.03097 [pdf, other]

doi 10.1016/j.softx.2020.100602

FleCSPH: The Next Generation FleCSIble Parallel Computational Infrastructure for Smoothed Particle Hydrodynamics

Authors: Julien Loiseau, Hyun Lim, Mark Alexander Kaltenborn, Oleg Korobkin, Christopher M. Mauney, Irina Sagert, Wesley P. Even, Benjamin K. Bergen

Abstract: FleCSPH is a smoothed particle hydrodynamics simulation tool, based on the compile-time configurable framework FleCSI. The asynchronous distributed tree topology combined with a fast multipole method allows FleCSPH to efficiently compute hydrodynamics and long range particle-particle interactions. FleCSPH provides initial data generators, particle relaxation techniques, and standard evolution driv… ▽ More FleCSPH is a smoothed particle hydrodynamics simulation tool, based on the compile-time configurable framework FleCSI. The asynchronous distributed tree topology combined with a fast multipole method allows FleCSPH to efficiently compute hydrodynamics and long range particle-particle interactions. FleCSPH provides initial data generators, particle relaxation techniques, and standard evolution drivers, which can be easily modified and extended to user-specific setups. Data input/output uses the H5part format, compatible with modern visualization software. △ Less

Submitted 6 July, 2020; originally announced July 2020.

arXiv:1312.4991 [pdf, ps, other]

On the velocity space discretization for the Vlasov-Poisson system: comparison between Hermite spectral and Particle-in-Cell methods. Part 2: fully-implicit scheme

Authors: E. Camporeale, G. L. Delzanno, B. K. Bergen, J. D. Moulton

Abstract: We describe a spectral method for the numerical solution of the Vlasov-Poisson system where the velocity space is decomposed by means of an Hermite basis, and the configuration space is discretized via a Fourier decomposition. The novelty of our approach is an implicit time discretization that allows exact conservation of charge, momentum and energy. The computational efficiency and the cost-effec… ▽ More We describe a spectral method for the numerical solution of the Vlasov-Poisson system where the velocity space is decomposed by means of an Hermite basis, and the configuration space is discretized via a Fourier decomposition. The novelty of our approach is an implicit time discretization that allows exact conservation of charge, momentum and energy. The computational efficiency and the cost-effectiveness of this method are compared to the fully-implicit PIC method recently introduced by Markidis and Lapenta (2011) and Chen et al. (2011). The following examples are discussed: Langmuir wave, Landau dam**, ion-acoustic wave, two-stream instability. The Fourier-Hermite spectral method can achieve solutions that are several orders of magnitude more accurate at a fraction of the cost with respect to PIC. This paper concludes the study presented in Camporeale et al. (2013) where the same method has been described for a semi-implicit time discretization, and was compared against an explicit PIC. △ Less

Submitted 17 December, 2013; originally announced December 2013.

Comments: submitted to Journal of Computational Physics 16 pages, 7 figures. arXiv admin note: text overlap with arXiv:1311.2098

arXiv:1311.2098 [pdf, ps, other]

On the velocity space discretization for the Vlasov-Poisson system: comparison between Hermite spectral and Particle-in-Cell methods. Part 1: semi-implicit scheme

Authors: Enrico Camporeale, Gian Luca Delzanno, Benjamin K. Bergen, J. David Moulton

Abstract: We discuss a spectral method for the numerical solution of the Vlasov-Poisson system where the velocity space is decomposed by means of an Hermite basis. We describe a semi-implicit time discretization that extends the range of numerical stability relative to an explicit scheme. We also introduce and discuss the effects of an artificial collisional operator, which is necessary to take care of the… ▽ More We discuss a spectral method for the numerical solution of the Vlasov-Poisson system where the velocity space is decomposed by means of an Hermite basis. We describe a semi-implicit time discretization that extends the range of numerical stability relative to an explicit scheme. We also introduce and discuss the effects of an artificial collisional operator, which is necessary to take care of the velocity space filamentation problem, unavoidable in collisionless plasmas. The computational efficiency and the cost-effectiveness of this method are compared to a Particle-in-Cell (PIC) method in the case of a two-dimensional phase space. The following examples are discussed: Langmuir wave, Landau dam**, ion-acoustic wave, two-stream instability, and plasma echo. The Hermite spectral method can achieve solutions that are several orders of magnitude more accurate (at a fraction of the cost) with respect to the PIC method. △ Less

Submitted 17 December, 2013; v1 submitted 8 November, 2013; originally announced November 2013.

Comments: 29 pages; 13 figures; submitted to Journal of Computational Physics

Showing 1–23 of 23 results for author: Bergen, B K