Search | arXiv e-print repository

A blind spot for large language models: Supradiegetic linguistic information

Authors: Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or… ▽ More Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or even "language". We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences. △ Less

Submitted 16 May, 2024; v1 submitted 11 June, 2023; originally announced June 2023.

Comments: 21 pages, 6 figures, 3 tables. Accepted at IC2S2 2024. arXiv admin note: text overlap with arXiv:2206.02608, arXiv:2303.12712, arXiv:2305.10601, arXiv:2305.06424, arXiv:1908.08530 by other authors

Journal ref: Plutonics, Volume 17, 2024, pages 107 - 156

arXiv:2305.12160 [pdf, other]

Park visitation and walkshed demographics in the United States

Authors: Kelsey Linnell, Mikaela Fudolig, Laura Bloomfield, Thomas McAndrew, Taylor H. Ricketts, Jarlath P. M. O'Neil-Dunne, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: A large and growing body of research demonstrates the value of local parks to mental and physical well-being. Recently, researchers have begun using passive digital data sources to investigate equity in usage; exactly who is benefiting from parks? Early studies suggest that park visitation differs according to demographic features, and that the demographic composition of a park's surrounding neigh… ▽ More A large and growing body of research demonstrates the value of local parks to mental and physical well-being. Recently, researchers have begun using passive digital data sources to investigate equity in usage; exactly who is benefiting from parks? Early studies suggest that park visitation differs according to demographic features, and that the demographic composition of a park's surrounding neighborhood may be related to the utilization a park receives. Employing a data set of park visitations generated by observations of roughly 50 million mobile devices in the US in 2019, we assess the ability of the demographic composition of a park's walkshed to predict its yearly visitation. Predictive models are constructed using Support Vector Regression, LASSO, Elastic Net, and Random Forests. Surprisingly, our results suggest that the demographic composition of a park's walkshed demonstrates little to no utility for predicting visitation. △ Less

Submitted 20 May, 2023; originally announced May 2023.

arXiv:2305.08978 [pdf, other]

An assessment of measuring local levels of homelessness through proxy social media signals

Authors: Yoshi Meke Bird, Sarah E. Grobe, Michael V. Arnold, Sean P. Rogers, Mikaela I. Fudolig, Julia Witte Zimmerman, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``hom… ▽ More Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``homeless.'' Correlations between homelessness-related tweet counts and ranked per capita homelessness volume, but not general-population densities, suggest a relationship between the likelihood of Twitter users to personally encounter or observe homelessness in their everyday lives and their likelihood to communicate about it online. An increase to the log-odds of ``homeless'' appearing in an English-language tweet, as well as an acceleration in the increase in average tweet sentiment, suggest that tweets about homelessness are also affected by trends at the nation-scale. Additionally, changes to the lexical content of tweets over time suggest that reversals to the polarity of national or state-level trends may be detectable through an increase in political or service-sector language over the semantics of charity or direct appeals. An analysis of user account type also revealed changes to Twitter-use patterns by accounts authored by individuals versus entities that may provide an additional signal to confirm changes to homelessness density in a given jurisdiction. While a computational approach to social media analysis may provide a low-cost, real-time dataset rich with information about nationwide and localized impacts of homelessness and homelessness policy, we find that practical issues abound, limiting the potential of social media as a proxy to complement other measures of homelessness. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 29 pages, 21 figures

arXiv:2208.09496 [pdf, other]

doi 10.1057/s41599-023-01680-4

A decomposition of book structure through ousiometric fluctuations in cumulative word-time

Authors: Mikaela Irene Fudolig, Thayer Alshaabi, Kathryn Cramer, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulati… ▽ More While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative. △ Less

Submitted 11 May, 2023; v1 submitted 19 August, 2022; originally announced August 2022.

Comments: published in Humanities and Social Sciences Communications

Journal ref: Humanit Soc Sci Commun 10, 187 (2023)

arXiv:2205.15937 [pdf, other]

Spatial changes in park visitation at the onset of the pandemic

Authors: Kelsey Linnell, Mikaela Fudolig, Aaron Schwartz, Taylor H. Ricketts, Jarlath P. M. O'Neil-Dunne, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: The COVID-19 pandemic disrupted the mobility patterns of a majority of Americans beginning in March 2020. Despite the beneficial, socially distanced activity offered by outdoor recreation, confusing and contradictory public health messaging complicated access to natural spaces. Working with a dataset comprising the locations of roughly 50 million distinct mobile devices in 2019 and 2020, we analyz… ▽ More The COVID-19 pandemic disrupted the mobility patterns of a majority of Americans beginning in March 2020. Despite the beneficial, socially distanced activity offered by outdoor recreation, confusing and contradictory public health messaging complicated access to natural spaces. Working with a dataset comprising the locations of roughly 50 million distinct mobile devices in 2019 and 2020, we analyze weekly visitation patterns for 8,135 parks across the United States. Using Bayesian inference, we identify regions that experienced a substantial change in visitation in the first few weeks of the pandemic. We find that regions that did not exhibit a change were likely to have smaller populations, and to have voted more republican than democrat in the 2020 elections. Our study contributes to a growing body of literature using passive observations to explore who benefits from access to nature. △ Less

Submitted 1 April, 2022; originally announced May 2022.

arXiv:2111.03691 [pdf, other]

The Ball Pit Algorithm: A Markov Chain Monte Carlo Method Based on Path Integrals

Authors: Miguel Fudolig, Reka Howard

Abstract: The Ball Pit Algorithm (BPA) is a novel Markov chain Monte Carlo (MCMC) algorithm for sampling marginal posterior distributions developed from the path integral formulation of the Bayesian analysis for Markov chains. The BPA yielded comparable results to the Hamiltonian Monte Carlo as implemented by the adaptive No U-Turn Sampler (NUTS) in sampling posterior distributions for simulated data from B… ▽ More The Ball Pit Algorithm (BPA) is a novel Markov chain Monte Carlo (MCMC) algorithm for sampling marginal posterior distributions developed from the path integral formulation of the Bayesian analysis for Markov chains. The BPA yielded comparable results to the Hamiltonian Monte Carlo as implemented by the adaptive No U-Turn Sampler (NUTS) in sampling posterior distributions for simulated data from Bernoulli and Poisson likelihoods. One major advantage of the BPA is its significantly lower computational time, which was measured to be at least 95% faster than NUTS in analyzing single parameter models. The BPA was also applied to a multi-parameter Cauchy model using real data of the height differences of cross- and self-fertilized plants. The posterior medians for the location parameter were consistent with other Bayesian sampling methods. Additionally, the posterior median for the logarithm of the scale parameter obtained from the BPA was close to the estimated posterior median calculated using the Laplace normal approximation. The computational time of the BPA implementation of the Cauchy analysis is 55% faster compared to that for NUTS. Overall, we have found that the BPA is a highly efficient alternative to the Hamiltonian Monte Carlo and other standard MCMC methods. △ Less

Submitted 5 November, 2021; originally announced November 2021.

arXiv:2110.06847 [pdf, other]

Ousiometrics and Telegnomics: The essence of meaning conforms to a two-dimensional powerful-weak and dangerous-safe framework with diverse corpora presenting a safety bias

Authors: P. S. Dodds, T. Alshaabi, M. I. Fudolig, J. W. Zimmerman, J. Lovato, S. Beaulieu, J. R. Minot, M. V. Arnold, A. J. Reagan, C. M. Danforth

Abstract: We define `ousiometrics' to be the study of essential meaning in whatever context that meaningful signals are communicated, and `telegnomics' as the study of remotely sensed knowledge. From work emerging through the middle of the 20th century, the essence of meaning has become generally accepted as being well captured by the three orthogonal dimensions of evaluation, potency, and activation (EPA).… ▽ More We define `ousiometrics' to be the study of essential meaning in whatever context that meaningful signals are communicated, and `telegnomics' as the study of remotely sensed knowledge. From work emerging through the middle of the 20th century, the essence of meaning has become generally accepted as being well captured by the three orthogonal dimensions of evaluation, potency, and activation (EPA). By re-examining first types and then tokens for the English language, and through the use of automatically annotated histograms -- `ousiograms' -- we find here that: 1. The essence of meaning conveyed by words is instead best described by a compass-like power-danger (PD) framework, and 2. Analysis of a disparate collection of large-scale English language corpora -- literature, news, Wikipedia, talk radio, and social media -- shows that natural language exhibits a systematic bias toward safe, low danger words -- a reinterpretation of the Pollyanna principle's positivity bias for written expression. To help justify our choice of dimension names and to help address the problems with representing observed ousiometric dimensions by bipolar adjective pairs, we introduce and explore `synousionyms' and `antousionyms' -- ousiometric counterparts of synonyms and antonyms. We further show that the PD framework revises the circumplex model of affect as a more general model of state of mind. Finally, we use our findings to construct and test a prototype `ousiometer', a telegnomic instrument that measures ousiometric time series for temporal corpora. We contend that our power-danger ousiometric framework provides a complement for entropy-based measurements, and may be of value for the study of a wide variety of communication across biological and artificial life. △ Less

Submitted 29 March, 2023; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 40 pages (34 page main manuscript, 6 page appendix), 15 figures (9 main, 6 appendix), 4 tables

arXiv:2110.00587 [pdf, other]

doi 10.1007/s41109-022-00446-2

Sentiment and structure in word co-occurrence networks on Twitter

Authors: Mikaela Irene Fudolig, Thayer Alshaabi, Michael V. Arnold, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton's presidential bi… ▽ More We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton's presidential bid in 2016. We then analyze the network properties in conjunction with the word scores by comparing with null models to separate the effects of the network structure and the score distribution. Neutral words are found to be dominant and most words, regardless of polarity, tend to co-occur with neutral words. We do not observe any score homophily among positive and negative words. However, when we perform network backboning, community detection results in word grou**s with meaningful narratives, and the happiness scores of the words in each group correspond to its respective theme. Thus, although we observe no clear relationship between happiness scores and co-occurrence at the node or edge level, a community-centric approach can isolate themes of competing sentiments in a corpus. △ Less

Submitted 1 October, 2021; originally announced October 2021.

Journal ref: Applied Network Science 7, 9 (2022)

arXiv:2109.09010 [pdf, other]

doi 10.3389/frai.2021.783778

Augmenting semantic lexicons using word embeddings and transfer learning

Authors: Thayer Alshaabi, Colin M. Van Oort, Mikaela Irene Fudolig, Michael V. Arnold, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researc… ▽ More Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Here, we propose two models for automatic lexicon expansion. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost. △ Less

Submitted 2 November, 2021; v1 submitted 18 September, 2021; originally announced September 2021.

Comments: 17 pages, 8 figures

Journal ref: Front. Artif. Intell. 4:783778 (2022)

arXiv:2009.00252 [pdf, other]

doi 10.1140/epjds/s13688-021-00272-z

Internal migration and mobile communication patterns among pairs with strong ties

Authors: Mikaela Irene D. Fudolig, Daniel Monsivais, Kunal Bhattacharya, Hang-Hyun Jo, Kimmo Kaski

Abstract: Using large-scale call detail records of anonymised mobile phone service subscribers with demographic and location information, we investigate how a long-distance residential move within the country affects the mobile communication patterns between an ego who moved and a frequently called alter who did not move. By using clustering methods in analysing the call frequency time series, we find that… ▽ More Using large-scale call detail records of anonymised mobile phone service subscribers with demographic and location information, we investigate how a long-distance residential move within the country affects the mobile communication patterns between an ego who moved and a frequently called alter who did not move. By using clustering methods in analysing the call frequency time series, we find that such ego-alter pairs are grouped into two clusters, those with the call frequency increasing and those with the call frequency decreasing after the move of the ego. This indicates that such residential moves are correlated with a change in the communication pattern soon after moving. We find that the pre-move calling behaviour is a relevant predictor for the post-move calling behaviour. While demographic and location information can help in predicting whether the call frequency will rise or decay, they are not relevant in predicting the actual call frequency volume. We also note that at four months after the move, most of these close pairs maintain contact, even if the call frequency is decreased. △ Less

Submitted 5 April, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

Comments: published version in EPJ Data Science

Journal ref: EPJ Data Science 10, 16 (2021)

arXiv:1907.13334 [pdf, ps, other]

doi 10.1371/journal.pone.0227037

Link-centric analysis of variation by demographics in mobile phone communication patterns

Authors: Mikaela Irene D. Fudolig, Kunal Bhattacharya, Daniel Monsivais, Hang-Hyun Jo, Kimmo Kaski

Abstract: We present a link-centric approach to study variation in the mobile phone communication patterns of individuals. Unlike most previous research on call detail records that focused on the variation of phone usage across individual users, we examine how the calling and texting patterns obtained from call detail records vary among pairs of users and how these patterns are affected by the nature of rel… ▽ More We present a link-centric approach to study variation in the mobile phone communication patterns of individuals. Unlike most previous research on call detail records that focused on the variation of phone usage across individual users, we examine how the calling and texting patterns obtained from call detail records vary among pairs of users and how these patterns are affected by the nature of relationships between users. To demonstrate this link-centric perspective, we extract factors that contribute to the variation in the mobile phone communication patterns and predict demographics-related quantities for pairs of users. The time of day and the channel of communication (calls or texts) are found to explain most of the variance among pairs that frequently call each other. Furthermore, we find that this variation can be used to predict the relationship between the pairs of users, as inferred from their age and gender, as well as the age of the younger user in a pair. From the classifier performance across different age and gender groups as well as the inherent class overlap suggested by the estimate of the bounds of the Bayes error, we gain insights into the similarity and differences of communication patterns across different relationships. △ Less

Submitted 16 December, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

Journal ref: PLoS ONE 15(1) (2020): e0227037

arXiv:1808.10166 [pdf, other]

doi 10.1007/s42001-019-00054-8

Different patterns of social closeness observed in mobile phone communication

Authors: Mikaela Irene D. Fudolig, Daniel Monsivais, Kunal Bhattacharya, Hang-Hyun Jo, Kimmo Kaski

Abstract: We analyze a large-scale mobile phone call dataset containing information on the age, gender, and billing locality of users to get insight into social closeness in pairs of individuals of similar age. We show that in addition to using the demographic information, the ranking of contacts by their call frequency in egocentric networks is crucial to characterize the different communication patterns.… ▽ More We analyze a large-scale mobile phone call dataset containing information on the age, gender, and billing locality of users to get insight into social closeness in pairs of individuals of similar age. We show that in addition to using the demographic information, the ranking of contacts by their call frequency in egocentric networks is crucial to characterize the different communication patterns. We find that mutually top-ranked opposite-gender pairs show the highest levels of call frequency and daily regularity, which is consistent with the behavior of real-life romantic partners. At somewhat lower level of call frequency and daily regularity come the mutually top-ranked same-gender pairs, while the lowest call frequency and daily regularity are observed for mutually non-top-ranked pairs. We have also observed that older pairs tend to call less frequently and less regularly than younger pairs, while the average call durations exhibit a more complex dependence on age. We expect that a more detailed analysis can help us better characterize the nature of relationships between pairs of individuals and distinguish between various types of relations, such as siblings, friends, and romantic partners. △ Less

Submitted 7 August, 2019; v1 submitted 30 August, 2018; originally announced August 2018.

Comments: 17 pages, 5 figures, 2 tables. Journal of Computational Social Science (published online, 2019)

Showing 1–12 of 12 results for author: Fudolig, M