-
Stochastic Guidance of Buoyancy Controlled Vehicles under Ice Shelves using Ocean Currents
Authors:
Federico Rossi,
Andrew Branch,
Michael P. Schodlok,
Timothy Stanton,
Ian G. Fenty,
Joshua Vander Hook,
Evan B. Clark
Abstract:
We propose a novel technique for guidance of buoyancy-controlled vehicles in uncertain under-ice ocean flows. In-situ melt rate measurements collected at the grounding zone of Antarctic ice shelves, where the ice shelf meets the underlying bedrock, are essential to constrain models of future sea level rise. Buoyancy-controlled vehicles, which control their vertical position in the water column thr…
▽ More
We propose a novel technique for guidance of buoyancy-controlled vehicles in uncertain under-ice ocean flows. In-situ melt rate measurements collected at the grounding zone of Antarctic ice shelves, where the ice shelf meets the underlying bedrock, are essential to constrain models of future sea level rise. Buoyancy-controlled vehicles, which control their vertical position in the water column through internal actuation but have no means of horizontal propulsion, offer an affordable and reliable platform for such in-situ data collection. However, reaching the grounding zone requires vehicles to traverse tens of kilometers under the ice shelf, with approximate position knowledge and no means of communication, in highly variable and uncertain ocean currents. To address this challenge, we propose a partially observable MDP approach that exploits model-based knowledge of the under-ice currents and, critically, of their uncertainty, to synthesize effective guidance policies. The approach uses approximate dynamic programming to model uncertainty in the currents, and QMDP to address localization uncertainty. Numerical experiments show that the policy can deliver up to 88.8% of underwater vehicles to the grounding zone -- a 33% improvement compared to state-of-the-art guidance techniques, and a 262% improvement over uncontrolled drifters. Collectively, these results show that model-based under-ice guidance is a highly promising technique for exploration of under-ice cavities, and has the potential to enable cost-effective and scalable access to these challenging and rarely observed environments.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Adaptive Teaching in Heterogeneous Agents: Balancing Surprise in Sparse Reward Scenarios
Authors:
Emma Clark,
Kanghyun Ryu,
Negar Mehr
Abstract:
Learning from Demonstration (LfD) can be an efficient way to train systems with analogous agents by enabling ``Student'' agents to learn from the demonstrations of the most experienced ``Teacher'' agent, instead of training their policy in parallel. However, when there are discrepancies in agent capabilities, such as divergent actuator power or joint angle constraints, naively replicating demonstr…
▽ More
Learning from Demonstration (LfD) can be an efficient way to train systems with analogous agents by enabling ``Student'' agents to learn from the demonstrations of the most experienced ``Teacher'' agent, instead of training their policy in parallel. However, when there are discrepancies in agent capabilities, such as divergent actuator power or joint angle constraints, naively replicating demonstrations that are out of bounds for the Student's capability can limit efficient learning. We present a Teacher-Student learning framework specifically tailored to address the challenge of heterogeneity between the Teacher and Student agents. Our framework is based on the concept of ``surprise'', inspired by its application in exploration incentivization in sparse-reward environments. Surprise is repurposed to enable the Teacher to detect and adapt to differences between itself and the Student. By focusing on maximizing its surprise in response to the environment while concurrently minimizing the Student's surprise in response to the demonstrations, the Teacher agent can effectively tailor its demonstrations to the Student's specific capabilities and constraints. We validate our method by demonstrating improvements in the Student's learning in control tasks within sparse-reward environments.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
On HTLC-Based Protocols for Multi-Party Cross-Chain Swaps
Authors:
Emily Clark,
Chloe Georgiou,
Katelyn Poon,
Marek Chrobak
Abstract:
In his 2018 paper, Herlihy introduced an atomic protocol for multi-party asset swaps across different blockchains. His model represents an asset swap by a directed graph whose nodes are the participating parties and edges represent asset transfers, and rational behavior of the participants is captured by a preference relation between a protocol's outcomes. Asset transfers between parties are achie…
▽ More
In his 2018 paper, Herlihy introduced an atomic protocol for multi-party asset swaps across different blockchains. His model represents an asset swap by a directed graph whose nodes are the participating parties and edges represent asset transfers, and rational behavior of the participants is captured by a preference relation between a protocol's outcomes. Asset transfers between parties are achieved using smart contracts. These smart contracts are quite involved and they require storage and processing of a large number of paths in the swap digraph, limiting practical significance of his protocol. His paper also describes a different protocol that uses only standard hash time-lock contracts (HTLC's), but this simpler protocol applies only to some special types of digraphs. He left open the question whether there is a simple and efficient protocol for cross-chain asset swaps in arbitrary digraphs. Motivated by this open problem, we conducted a comprehensive study of \emph{HTLC-based protocols}, in which all asset transfers are implemented with HTLCs. Our main contribution is a full characterization of swap digraphs that have such protocols.
△ Less
Submitted 14 March, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
Authors:
Quan Wang,
Yiling Huang,
Guanlong Zhao,
Evan Clark,
Wei Xia,
Hank Liao
Abstract:
In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (A…
▽ More
In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
△ Less
Submitted 26 June, 2024; v1 submitted 7 January, 2024;
originally announced January 2024.
-
Predicting Spine Geometry and Scoliosis from DXA Scans
Authors:
Amir Jamaludin,
Timor Kadir,
Emma Clark,
Andrew Zisserman
Abstract:
Our objective in this paper is to estimate spine curvature in DXA scans. To this end we first train a neural network to predict the middle spine curve in the scan, and then use an integral-based method to determine the curvature along the spine curve. We use the curvature to compare to the standard angle scoliosis measure obtained using the DXA Scoliosis Method (DSM). The performance improves over…
▽ More
Our objective in this paper is to estimate spine curvature in DXA scans. To this end we first train a neural network to predict the middle spine curve in the scan, and then use an integral-based method to determine the curvature along the spine curve. We use the curvature to compare to the standard angle scoliosis measure obtained using the DXA Scoliosis Method (DSM). The performance improves over the prior work of Jamaludin et al. 2018. We show that the maximum curvature can be used as a scoring function for ordering the severity of spinal deformation.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Temporal Evolution of Risk Behavior in a Disease Spread Simulation
Authors:
Ollin D. Langle-Chimal,
Scott C. Merrill,
Eric M. Clark,
Gabriela Bucini,
Tung-Lin Liu,
Trisha R. Shrum,
Christopher Koliba,
Asim Zia,
Julia M. Smith,
Nicholas Cheney
Abstract:
Human behavior is a dynamic process that evolves with experience. Understanding the evolution of individual's risk propensity is critical to design public health interventions to propitiate the adoption of better biosecurity protocols and thus, prevent the transmission of an infectious disease. Using an experimental game that simulates the spread of a disease in a network of porcine farms, we meas…
▽ More
Human behavior is a dynamic process that evolves with experience. Understanding the evolution of individual's risk propensity is critical to design public health interventions to propitiate the adoption of better biosecurity protocols and thus, prevent the transmission of an infectious disease. Using an experimental game that simulates the spread of a disease in a network of porcine farms, we measure how learning from experience affects the risk aversion of over $1000$ players. We used a fully automated approach to segment the players into 4 categories based on the temporal trends of their game plays and compare the outcomes of their overall game performance. We found that the risk tolerant group is $50\%$ more likely to incur an infection than the risk averse one. We also find that while all individuals decrease the amount of time it takes to make decisions as they become more experienced at the game, we find a group of players with constant decision strategies who rapidly decrease their time to make a decision and a second context-aware decision group that contemplates longer before decisions while presumably performing a real-time risk assessment. The behavioral strategies employed by players in this simulated setting could be used in the future as an early warning signal to identify undesirable biosecurity-related risk aversion preferences, or changes in behavior, which may allow for targeted interventions to help mitigate them.
△ Less
Submitted 1 June, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting
Authors:
Akhila Yerukola,
Xuhui Zhou,
Elizabeth Clark,
Maarten Sap
Abstract:
Most existing stylistic text rewriting methods and evaluation metrics operate on a sentence level, but ignoring the broader context of the text can lead to preferring generic, ambiguous, and incoherent rewrites. In this paper, we investigate integrating the preceding textual context into both the $\textit{rewriting}$ and $\textit{evaluation}$ stages of stylistic text rewriting, and introduce a new…
▽ More
Most existing stylistic text rewriting methods and evaluation metrics operate on a sentence level, but ignoring the broader context of the text can lead to preferring generic, ambiguous, and incoherent rewrites. In this paper, we investigate integrating the preceding textual context into both the $\textit{rewriting}$ and $\textit{evaluation}$ stages of stylistic text rewriting, and introduce a new composite contextual evaluation metric $\texttt{CtxSimFit}$ that combines similarity to the original sentence with contextual cohesiveness. We comparatively evaluate non-contextual and contextual rewrites in formality, toxicity, and sentiment transfer tasks. Our experiments show that humans significantly prefer contextual rewrites as more fitting and natural over non-contextual ones, yet existing sentence-level automatic metrics (e.g., ROUGE, SBERT) correlate poorly with human preferences ($ρ$=0--0.3). In contrast, human preferences are much better reflected by both our novel $\texttt{CtxSimFit}$ ($ρ$=0.7--0.9) as well as proposed context-infused versions of common metrics ($ρ$=0.4--0.7). Overall, our findings highlight the importance of integrating context into the generation and especially the evaluation stages of stylistic text rewriting.
△ Less
Submitted 23 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
Authors:
Elizabeth Clark,
Shruti Rijhwani,
Sebastian Gehrmann,
Joshua Maynez,
Roee Aharoni,
Vitaly Nikolaev,
Thibault Sellam,
Aditya Siddhant,
Dipanjan Das,
Ankur P. Parikh
Abstract:
Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensi…
▽ More
Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation.
△ Less
Submitted 1 November, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Authors:
Anya Belz,
Craig Thomson,
Ehud Reiter,
Gavin Abercrombie,
Jose M. Alonso-Moral,
Mohammad Arvan,
Anouck Braggaar,
Mark Cieliebak,
Elizabeth Clark,
Kees van Deemter,
Tanvi Dinkar,
Ondřej Dušek,
Steffen Eger,
Qixiang Fang,
Mingqi Gao,
Albert Gatt,
Dimitra Gkatzia,
Javier González-Corbelle,
Dirk Hovy,
Manuela Hürlimann,
Takumi Ito,
John D. Kelleher,
Filip Klubicka,
Emiel Krahmer,
Huiyuan Lai
, et al. (17 additional authors not shown)
Abstract:
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, a…
▽ More
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
△ Less
Submitted 7 August, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
mFACE: Multilingual Summarization with Factual Consistency Evaluation
Authors:
Roee Aharoni,
Shashi Narayan,
Joshua Maynez,
Jonathan Herzig,
Elizabeth Clark,
Mirella Lapata
Abstract:
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically det…
▽ More
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
△ Less
Submitted 5 January, 2024; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Authors:
Lining Zhang,
Simon Mille,
Yufang Hou,
Daniel Deutsch,
Elizabeth Clark,
Yixin Liu,
Saad Mahamood,
Sebastian Gehrmann,
Miruna Clinciu,
Khyathi Chandu,
João Sedoc
Abstract:
To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar worke…
▽ More
To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.
△ Less
Submitted 13 June, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Pacific Lamprey Inspired Climbing
Authors:
Brian Van Stratum,
Kourosh Shoele,
Jonathan E. Clark
Abstract:
Snakes and their bio-inspired robot counterparts have demonstrated locomotion on a wide range of terrains. However, dynamic vertical climbing is one locomotion strategy that has received little attention in the existing snake robotics literature. We demonstrate a new scansorial gait and robot inspired by the locomotion of the Pacific Lamprey. This new gait allows a robot to steer while climbing on…
▽ More
Snakes and their bio-inspired robot counterparts have demonstrated locomotion on a wide range of terrains. However, dynamic vertical climbing is one locomotion strategy that has received little attention in the existing snake robotics literature. We demonstrate a new scansorial gait and robot inspired by the locomotion of the Pacific Lamprey. This new gait allows a robot to steer while climbing on flat, near-vertical surfaces. A reduced-order model is developed and used to explore the relationship between body actuation and vertical and lateral motions of the robot. Trident, the new wall climbing lamprey-inspired robot, demonstrates dynamic climbing on flat vertical surfaces with a peak net vertical stride displacement of 4.1 cm per step. Actuating at 1.3 Hz, Trident attains a vertical climbing speed of 4.8 cm/s (0.09 Bl/s) at specific resistance of 8.3. Trident can also traverse laterally at 9 cm/s (0.17 Bl/s). Moreover, Trident is able to make 14\% longer strides than the Pacific Lamprey when climbing vertically. The computational and experimental results demonstrate that a lamprey-inspired climbing gait coupled with appropriate attachment is a useful climbing strategy for snake robots climbing near vertical surfaces with limited push points.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Dialect-robust Evaluation of Generated Text
Authors:
Jiao Sun,
Thibault Sellam,
Elizabeth Clark,
Tu Vu,
Timothy Dozat,
Dan Garrette,
Aditya Siddhant,
Jacob Eisenstein,
Sebastian Gehrmann
Abstract:
Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as…
▽ More
Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Authors:
Sebastian Gehrmann,
Abhik Bhattacharjee,
Abinaya Mahendiran,
Alex Wang,
Alexandros Papangelis,
Aman Madaan,
Angelina McMillan-Major,
Anna Shvets,
Ashish Upadhyay,
Bingsheng Yao,
Bryan Wilie,
Chandra Bhagavatula,
Chaobin You,
Craig Thomson,
Cristina Garbacea,
Dakuo Wang,
Daniel Deutsch,
Deyi Xiong,
Di **,
Dimitra Gkatzia,
Dragomir Radev,
Elizabeth Clark,
Esin Durmus,
Faisal Ladhak,
Filip Ginter
, et al. (52 additional authors not shown)
Abstract:
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an…
▽ More
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
△ Less
Submitted 24 June, 2022; v1 submitted 22 June, 2022;
originally announced June 2022.
-
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Authors:
Sebastian Gehrmann,
Elizabeth Clark,
Thibault Sellam
Abstract:
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic mode…
▽ More
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Authors:
Elizabeth Clark,
Tal August,
Sofia Serrano,
Nikita Haduong,
Suchin Gururangan,
Noah A. Smith
Abstract:
Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, eva…
▽ More
Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators' accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.
△ Less
Submitted 7 July, 2021; v1 submitted 30 June, 2021;
originally announced July 2021.
-
PySensors: A Python Package for Sparse Sensor Placement
Authors:
Brian M. de Silva,
Krithika Manohar,
Emily Clark,
Bingni W. Brunton,
Steven L. Brunton,
J. Nathan Kutz
Abstract:
PySensors is a Python package for selecting and placing a sparse set of sensors for classification and reconstruction tasks. Specifically, PySensors implements algorithms for data-driven sparse sensor placement optimization for reconstruction (SSPOR) and sparse sensor placement optimization for classification (SSPOC). In this work we provide a brief description of the mathematical algorithms and t…
▽ More
PySensors is a Python package for selecting and placing a sparse set of sensors for classification and reconstruction tasks. Specifically, PySensors implements algorithms for data-driven sparse sensor placement optimization for reconstruction (SSPOR) and sparse sensor placement optimization for classification (SSPOC). In this work we provide a brief description of the mathematical algorithms and theory for sparse sensor optimization, along with an overview and demonstration of the features implemented in PySensors (with code examples). We also include practical advice for user and a list of potential extensions to PySensors. Software is available at https://github.com/dynamicslab/pysensors.
△ Less
Submitted 20 February, 2021;
originally announced February 2021.
-
Bracketing brackets with bras and kets
Authors:
Emily Clark,
Angelie Vincent,
J. Nathan Kutz,
Steven L. Brunton
Abstract:
Brackets are an essential component in aircraft manufacture and design, joining parts together, supporting weight, holding wires, and strengthening joints. Hundreds or thousands of unique brackets are used in every aircraft, but manufacturing a large number of distinct brackets is inefficient and expensive. Fortunately, many so-called "different" brackets are in fact very similar or even identical…
▽ More
Brackets are an essential component in aircraft manufacture and design, joining parts together, supporting weight, holding wires, and strengthening joints. Hundreds or thousands of unique brackets are used in every aircraft, but manufacturing a large number of distinct brackets is inefficient and expensive. Fortunately, many so-called "different" brackets are in fact very similar or even identical to each other. In this manuscript, we present a data-driven framework for constructing a comparatively small group of representative brackets from a large catalog of current brackets, based on hierarchical clustering of bracket data. We find that for a modern commercial aircraft, the full set of brackets can be reduced by 30\% while still describing half of the test set sufficiently accurately. This approach is based on designing an inner product that quantifies a multi-objective similarity between two brackets, which are the "bra" and the "ket" of the inner product. Although we demonstrate this algorithm to reduce the number of brackets in aerospace manufacturing, it may be generally applied to any large-scale component standardization effort.
△ Less
Submitted 31 July, 2020;
originally announced August 2020.
-
Evaluation of Text Generation: A Survey
Authors:
Asli Celikyilmaz,
Elizabeth Clark,
Jianfeng Gao
Abstract:
The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made and the challenges still being fac…
▽ More
The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions.
△ Less
Submitted 18 May, 2021; v1 submitted 26 June, 2020;
originally announced June 2020.
-
Assessing the impact of the coronavirus lockdown on unhappiness, loneliness, and boredom using Google Trends
Authors:
Abel Brodeur,
Andrew E. Clark,
Sarah Fleche,
Nattavudh Powdthavee
Abstract:
The COVID-19 pandemic has led many governments to implement lockdowns. While lockdowns may help to contain the spread of the virus, it is possible that substantial damage to population well-being will result. This study relies on Google Trends data and tests whether the lockdowns implemented in Europe and America led to changes in well-being related topic search terms. Using different methods to e…
▽ More
The COVID-19 pandemic has led many governments to implement lockdowns. While lockdowns may help to contain the spread of the virus, it is possible that substantial damage to population well-being will result. This study relies on Google Trends data and tests whether the lockdowns implemented in Europe and America led to changes in well-being related topic search terms. Using different methods to evaluate the causal effects of lockdown, we find a substantial increase in the search intensity for boredom in Europe and the US. We also found a significant increase in searches for loneliness, worry and sadness, while searches for stress, suicide and divorce on the contrary fell. Our results suggest that people's mental health may have been severely affected by the lockdown.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
TuringAdvice: A Generative and Dynamic Evaluation of Language Use
Authors:
Rowan Zellers,
Ari Holtzman,
Elizabeth Clark,
Lianhui Qin,
Ali Farhadi,
Ye** Choi
Abstract:
We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other.
E…
▽ More
We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other.
Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.
△ Less
Submitted 12 April, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Effects of Social Cues on Biosecurity Compliance in Livestock Facilities: Evidence from Experimental Simulations
Authors:
Luke Trinity,
Scott C. Merrill,
Eric Clark,
Christopher J. Koliba,
Asim Zia,
Gabriela Bucini,
Julia M. Smith
Abstract:
Disease outbreaks in U.S. animal livestock industries have economic impacts measured in hundreds of millions of dollars per year. Biosecurity, or procedures intended to protect animals against disease, is known to be effective at reducing infection risk at facilities. Yet to the detriment of animal health, humans do not always follow biosecurity protocols. Human behavioral factors have been shown…
▽ More
Disease outbreaks in U.S. animal livestock industries have economic impacts measured in hundreds of millions of dollars per year. Biosecurity, or procedures intended to protect animals against disease, is known to be effective at reducing infection risk at facilities. Yet to the detriment of animal health, humans do not always follow biosecurity protocols. Human behavioral factors have been shown to influence willingness to follow biosecurity protocols. Here we show how social cues may affect cooperation with a biosecurity practice. Participants were immersed in a simulated swine production facility through a graphical user interface and prompted to make a decision that addressed their willingness to comply with a biosecurity practice. We tested the effect of varying three experimental variables: (1) the risk of acquiring an infection, (2) the delivery method of the infection risk information (numerical versus graphical), and (3) behavior of an automated coworker in the facility. We provide evidence that participants changed their behavior when they observed a simulated worker making a choice to follow or not follow a biosecurity protocol, even though the simulated worker had no economic effect on the participants' payouts. These results advance the understanding of human behavioral effects on biosecurity protocol decisions; demonstrating that social cues need to be considered by livestock facility managers when develo** policies to make agricultural systems more disease resilient.
△ Less
Submitted 28 October, 2019;
originally announced October 2019.
-
Using Digital Field Experiments To Elicit Risk Mitigation Behavioral Strategies For Disease Management Across Agricultural Production Systems
Authors:
Eric M. Clark,
Scott C. Merrill,
Luke Trinity,
Gabriela Bucini,
Nicholas Cheney,
Ollin Langle-Chimal,
Trisha Shrum,
Christopher Koliba,
Asim Zia,
Julia M. Smith
Abstract:
Failing to mitigate propagation of disease spread can result in dire economic consequences for agricultural networks. Pathogens like Porcine Epidemic Diarrhea virus, can quickly spread among producers. Biosecurity is designed to prevent infection transmission. When considering biosecurity investments, management must balance the cost of protection versus the consequences of contracting an infectio…
▽ More
Failing to mitigate propagation of disease spread can result in dire economic consequences for agricultural networks. Pathogens like Porcine Epidemic Diarrhea virus, can quickly spread among producers. Biosecurity is designed to prevent infection transmission. When considering biosecurity investments, management must balance the cost of protection versus the consequences of contracting an infection. Thus, an examination of the decision making processes associated with investment in biosecurity is important for enhancing system wide biosecurity. Data gathered from digital field experiments can provide insights into behavioral strategies and inform the development of decision support systems. We created an online digital experiment to simulate outbreak scenarios among swine production supply chains, where participants were tasked with making biosecurity investment decisions. In Experiment One, we quantified the risk associated with each participant's decisions and delineated three dominant categories of risk attitudes: risk averse, risk tolerant, and opportunistic. Each risk class exhibited unique approaches in reaction to risk and disease information. We also tested how information uncertainty affects risk aversion, by varying the amount of visibility of the infection as well as the amount of biosecurity implemented across the system. We found evidence that more visibility in the number of infected sites increases risk averse behaviors, while more visibility in the amount of neighboring biosecurity increased risk taking behaviors. In Experiment Two, we were surprised to find no evidence for differences in behavior of livestock specialists compared to Amazon Mechanical Turk participants. Our findings provide support for using digital field experiments to study how risk communication affects behavior, which can provide insights towards more effective messaging strategies.
△ Less
Submitted 1 October, 2019; v1 submitted 20 September, 2019;
originally announced September 2019.
-
Counterfactual Story Reasoning and Generation
Authors:
Lianhui Qin,
Antoine Bosselut,
Ari Holtzman,
Chandra Bhagavatula,
Elizabeth Clark,
Ye** Choi
Abstract:
Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. Despite being considered a necessary component of AI-complete systems, few resources have been developed for evaluating counterfactual reasoning in narratives.
In this paper, we propose Counterfactual Story Rewriting: given an original story and an i…
▽ More
Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. Despite being considered a necessary component of AI-complete systems, few resources have been developed for evaluating counterfactual reasoning in narratives.
In this paper, we propose Counterfactual Story Rewriting: given an original story and an intervening counterfactual event, the task is to minimally revise the story to make it compatible with the given counterfactual event. Solving this task will require deep understanding of causal narrative chains and counterfactual invariance, and integration of such story reasoning capabilities into conditional language generation models.
We present TimeTravel, a new dataset of 29,849 counterfactual rewritings, each with the original story, a counterfactual event, and human-generated revision of the original story compatible with the counterfactual event. Additionally, we include 80,115 counterfactual "branches" without a rewritten storyline to support future work on semi- or un-supervised approaches to counterfactual story rewriting.
Finally, we evaluate the counterfactual rewriting capacities of several competitive baselines based on pretrained language models, and assess whether common overlap and model-based automatic metrics for text generation correlate well with human scores for counterfactual rewriting.
△ Less
Submitted 12 September, 2019; v1 submitted 9 September, 2019;
originally announced September 2019.
-
A Sentiment Analysis of Breast Cancer Treatment Experiences and Healthcare Perceptions Across Twitter
Authors:
Eric M. Clark,
Ted James,
Chris A. Jones,
Amulya Alapati,
Promise Ukandu,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Background: Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. In prior work, [Crannell et. al.], we have studied an active cancer patient population on Twitter and compiled a set of tweets describing the…
▽ More
Background: Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. In prior work, [Crannell et. al.], we have studied an active cancer patient population on Twitter and compiled a set of tweets describing their experience with this disease. We refer to these online public testimonies as "Invisible Patient Reported Outcomes" (iPROs), because they carry relevant indicators, yet are difficult to capture by conventional means of self-report. Methods: Our present study aims to identify tweets related to the patient experience as an additional informative tool for monitoring public health. Using Twitter's public streaming API, we compiled over 5.3 million "breast cancer" related tweets spanning September 2016 until mid December 2017. We combined supervised machine learning methods with natural language processing to sift tweets relevant to breast cancer patient experiences. We analyzed a sample of 845 breast cancer patient and survivor accounts, responsible for over 48,000 posts. We investigated tweet content with a hedonometric sentiment analysis to quantitatively extract emotionally charged topics. Results: We found that positive experiences were shared regarding patient treatment, raising support, and spreading awareness. Further discussions related to healthcare were prevalent and largely negative focusing on fear of political legislation that could result in loss of coverage. Conclusions: Social media can provide a positive outlet for patients to discuss their needs and concerns regarding their healthcare coverage and treatment needs. Capturing iPROs from online communication can help inform healthcare professionals and lead to more connected and personalized treatment regimens.
△ Less
Submitted 12 October, 2018; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Sounding Board: A User-Centric and Content-Driven Social Chatbot
Authors:
Hao Fang,
Hao Cheng,
Maarten Sap,
Elizabeth Clark,
Ari Holtzman,
Ye** Choi,
Noah A. Smith,
Mari Ostendorf
Abstract:
We present Sounding Board, a social chatbot that won the 2017 Amazon Alexa Prize. The system architecture consists of several components including spoken language processing, dialogue management, language generation, and content management, with emphasis on user-centric and content-driven design. We also share insights gained from large-scale online logs based on 160,000 conversations with real-wo…
▽ More
We present Sounding Board, a social chatbot that won the 2017 Amazon Alexa Prize. The system architecture consists of several components including spoken language processing, dialogue management, language generation, and content management, with emphasis on user-centric and content-driven design. We also share insights gained from large-scale online logs based on 160,000 conversations with real-world users.
△ Less
Submitted 26 April, 2018;
originally announced April 2018.
-
Fusion of finite set distributions: Pointwise consistency and global cardinality
Authors:
Murat Üney,
Jérémie Houssineau,
Emmanuel Delande,
Simon J. Julier,
Daniel E. Clark
Abstract:
A recent trend in distributed multi-sensor fusion is to use random finite set filters at the sensor nodes and fuse the filtered distributions algorithmically using their exponential mixture densities (EMDs). Fusion algorithms which extend the celebrated covariance intersection and consensus based approaches are such examples. In this article, we analyse the variational principle underlying EMDs an…
▽ More
A recent trend in distributed multi-sensor fusion is to use random finite set filters at the sensor nodes and fuse the filtered distributions algorithmically using their exponential mixture densities (EMDs). Fusion algorithms which extend the celebrated covariance intersection and consensus based approaches are such examples. In this article, we analyse the variational principle underlying EMDs and show that the EMDs of finite set distributions do not necessarily lead to consistent fusion of cardinality distributions. Indeed, we demonstrate that these inconsistencies may occur with overwhelming probability in practice, through examples with Bernoulli, Poisson and independent identically distributed (IID) cluster processes. We prove that pointwise consistency of EMDs does not imply consistency in global cardinality and vice versa. Then, we redefine the variational problems underlying fusion and provide iterative solutions thereby establishing a framework that guarantees cardinality consistent fusion.
△ Less
Submitted 3 December, 2018; v1 submitted 17 February, 2018;
originally announced February 2018.
-
Latent Parameter Estimation in Fusion Networks Using Separable Likelihoods
Authors:
Murat Uney,
Bernard Mulgrew,
Daniel E Clark
Abstract:
Multi-sensor state space models underpin fusion applications in networks of sensors. Estimation of latent parameters in these models has the potential to provide highly desirable capabilities such as network self-calibration. Conventional solutions to the problem pose difficulties in scaling with the number of sensors due to the joint multi-sensor filtering involved when evaluating the parameter l…
▽ More
Multi-sensor state space models underpin fusion applications in networks of sensors. Estimation of latent parameters in these models has the potential to provide highly desirable capabilities such as network self-calibration. Conventional solutions to the problem pose difficulties in scaling with the number of sensors due to the joint multi-sensor filtering involved when evaluating the parameter likelihood. In this article, we propose a separable pseudo-likelihood which is a more accurate approximation compared to a previously proposed alternative under typical operating conditions. In addition, we consider using separable likelihoods in the presence of many objects and ambiguity in associating measurements with objects that originated them. To this end, we use a state space model with a hypothesis based parameterisation, and, develop an empirical Bayesian perspective in order to evaluate separable likelihoods on this model using local filtering. Bayesian inference with this likelihood is carried out using belief propagation on the associated pairwise Markov random field. We specify a particle algorithm for latent parameter estimation in a linear Gaussian state space model and demonstrate its efficacy for network self-calibration using measurements from non-cooperative targets in comparison with alternatives.
△ Less
Submitted 2 January, 2018; v1 submitted 2 August, 2017;
originally announced August 2017.
-
Bayesian data assimilation based on a family of outer measures
Authors:
Jeremie Houssineau,
Daniel E. Clark
Abstract:
A flexible representation of uncertainty that remains within the standard framework of probabilistic measure theory is presented along with a study of its properties. This representation relies on a specific type of outer measure that is based on the measure of a supremum, hence combining additive and highly sub-additive components. It is shown that this type of outer measure enables the introduct…
▽ More
A flexible representation of uncertainty that remains within the standard framework of probabilistic measure theory is presented along with a study of its properties. This representation relies on a specific type of outer measure that is based on the measure of a supremum, hence combining additive and highly sub-additive components. It is shown that this type of outer measure enables the introduction of intuitive concepts such as pullback and general data assimilation operations.
△ Less
Submitted 9 November, 2016;
originally announced November 2016.
-
Vaporous Marketing: Uncovering Pervasive Electronic Cigarette Advertisements on Twitter
Authors:
Eric M. Clark,
Chris A. Jones,
Jake Ryland Williams,
Allison N. Kurti,
Michell Craig Nortotsky,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Background: Twitter has become the "wild-west" of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, "kid-friendly" flavors, algorithmically generated false testimonials, and free samples. Methods:All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012…
▽ More
Background: Twitter has become the "wild-west" of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, "kid-friendly" flavors, algorithmically generated false testimonials, and free samples. Methods:All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012 through December 2014 (approximately 850,000 total tweets) were identified and categorized as Automated or Organic by combining a keyword classification and a machine trained Human Detection algorithm. A sentiment analysis using Hedonometrics was performed on Organic tweets to quantify the change in consumer sentiments over time. Commercialized tweets were topically categorized with key phrasal pattern matching. Results:The overwhelming majority (80%) of tweets were classified as automated or promotional in nature. The majority of these tweets were coded as commercialized (83.65% in 2013), up to 33% of which offered discounts or free samples and appeared on over a billion twitter feeds as impressions. The positivity of Organic (human) classified tweets has decreased over time (5.84 in 2013 to 5.77 in 2014) due to a relative increase in the negative words ban,tobacco,doesn't,drug,against,poison,tax and a relative decrease in the positive words like haha,good,cool. Automated tweets are more positive than organic (6.17 versus 5.84) due to a relative increase in the marketing words best,win,buy,sale,health,discount and a relative decrease in negative words like bad, hate, stupid, don't. Conclusions:Due to the youth presence on Twitter and the clinical uncertainty of the long term health complications of electronic cigarette consumption, the protection of public health warrants scrutiny and potential regulation of social media marketing.
△ Less
Submitted 5 March, 2016; v1 submitted 7 August, 2015;
originally announced August 2015.
-
Reply to Garcia et al.: Common mistakes in measuring frequency dependent word characteristics
Authors:
P. S. Dodds,
E. M. Clark,
S. Desu,
M. R. Frank,
A. J. Reagan,
J. R. Williams,
L. Mitchell,
K. D. Harris,
I. M. Kloumann,
J. P. Bagrow,
K. Megerdoomian,
M. T. McMahon,
B. F. Tivnan,
C. M. Danforth
Abstract:
We demonstrate that the concerns expressed by Garcia et al. are misplaced, due to (1) a misreading of our findings in [1]; (2) a widespread failure to examine and present words in support of asserted summary quantities based on word usage frequencies; and (3) a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists. In particular, we show that the English…
▽ More
We demonstrate that the concerns expressed by Garcia et al. are misplaced, due to (1) a misreading of our findings in [1]; (2) a widespread failure to examine and present words in support of asserted summary quantities based on word usage frequencies; and (3) a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists. In particular, we show that the English component of our study compares well statistically with two related surveys, that no survey design influence is apparent, and that estimates of measurement error do not explain the positivity biases reported in our work and that of others. We further demonstrate that for the frequency dependence of positivity---of which we explored the nuances in great detail in [1]---Garcia et al. did not perform a reanalysis of our data---they instead carried out an analysis of a different, statistically improper data set and introduced a nonlinearity before performing linear regression.
△ Less
Submitted 28 May, 2015; v1 submitted 25 May, 2015;
originally announced May 2015.
-
Sifting Robotic from Organic Text: A Natural Language Approach for Detecting Automation on Twitter
Authors:
Eric M. Clark,
Jake Ryland Williams,
Chris A. Jones,
Richard A. Galbraith,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. Due to the increasing popularity of Twitter, its perceived potential for exerting social influence has led to the rise of a diverse community of automatons, commonly referred to as bots. These inorganic and semi-organic Twitter entities can range from the benevol…
▽ More
Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. Due to the increasing popularity of Twitter, its perceived potential for exerting social influence has led to the rise of a diverse community of automatons, commonly referred to as bots. These inorganic and semi-organic Twitter entities can range from the benevolent (e.g., weather-update bots, help-wanted-alert bots) to the malevolent (e.g., spamming messages, advertisements, or radical opinions). Existing detection algorithms typically leverage meta-data (time between tweets, number of followers, etc.) to identify robotic accounts. Here, we present a powerful classification scheme that exclusively uses the natural language text from organic users to provide a criterion for identifying accounts posting automated messages. Since the classifier operates on text alone, it is flexible and may be applied to any textual data beyond the Twitter-sphere.
△ Less
Submitted 14 June, 2016; v1 submitted 16 May, 2015;
originally announced May 2015.
-
Identifying missing dictionary entries with frequency-conserving context models
Authors:
Jake Ryland Williams,
Eric M. Clark,
James P. Bagrow,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in tex…
▽ More
In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data, (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary---an extensive, online, collaborative, and open-source dictionary that contains over 100,000 phrasal-definitions---we develop highly effective filters for the identification of meaningful, missing phrase-entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, develo** a breakthrough, lexical extraction technique, and expanding our knowledge of the defined English lexicon of phrases.
△ Less
Submitted 28 July, 2015; v1 submitted 6 March, 2015;
originally announced March 2015.
-
Zipf's law holds for phrases, not words
Authors:
Jake Ryland Williams,
Paul R. Lessard,
Suma Desu,
Eric Clark,
James P. Bagrow,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
With Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirically…
▽ More
With Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirically that Zipf's law for phrases extends over as many as nine orders of rank magnitude. In doing so, we develop a principled and scalable statistical mechanical method of random text partitioning, which opens up a rich frontier of rigorous text analysis via a rank ordering of mixed length phrases.
△ Less
Submitted 4 March, 2015; v1 submitted 19 June, 2014;
originally announced June 2014.
-
Human language reveals a universal positivity bias
Authors:
Peter Sheridan Dodds,
Eric M. Clark,
Suma Desu,
Morgan R. Frank,
Andrew J. Reagan,
Jake Ryland Williams,
Lewis Mitchell,
Kameron Decker Harris,
Isabel M. Kloumann,
James P. Bagrow,
Karine Megerdoomian,
Matthew T. McMahon,
Brian F. Tivnan,
Christopher M. Danforth
Abstract:
Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (1) the words of natural human language possess a universal positivity bias; (2) the estimated emotional content of words is consistent between languages under translation; and (3) this positivity bias i…
▽ More
Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (1) the words of natural human language possess a universal positivity bias; (2) the estimated emotional content of words is consistent between languages under translation; and (3) this positivity bias is strongly independent of frequency of word usage. Alongside these general regularities, we describe inter-language variations in the emotional spectrum of languages which allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.
△ Less
Submitted 15 June, 2014;
originally announced June 2014.