-
A Survey of Transformer Enabled Time Series Synthesis
Authors:
Alexander Sommers,
Logan Cummins,
Sudip Mittal,
Shahram Rahimi,
Maria Seale,
Joseph Jaboure,
Thomas Arnold
Abstract:
Generative AI has received much attention in the image and language domains, with the transformer neural network continuing to dominate the state of the art. Application of these models to time series generation is less explored, however, and is of great utility to machine learning, privacy preservation, and explainability research. The present survey identifies this gap at the intersection of the…
▽ More
Generative AI has received much attention in the image and language domains, with the transformer neural network continuing to dominate the state of the art. Application of these models to time series generation is less explored, however, and is of great utility to machine learning, privacy preservation, and explainability research. The present survey identifies this gap at the intersection of the transformer, generative AI, and time series data, and reviews works in this sparsely populated subdomain. The reviewed works show great variety in approach, and have not yet converged on a conclusive answer to the problems the domain poses. GANs, diffusion models, state space models, and autoencoders were all encountered alongside or surrounding the transformers which originally motivated the survey. While too open a domain to offer conclusive insights, the works surveyed are quite suggestive, and several recommendations for best practice, and suggestions of valuable future work, are provided.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection
Authors:
Yuxia Wang,
Jonibek Mansurov,
Petar Ivanov,
**yan Su,
Artem Shelmanov,
Akim Tsvigun,
Osama Mohammed Afzal,
Tarek Mahmoud,
Giovanni Puccetti,
Thomas Arnold,
Chenxi Whitehouse,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych,
Preslav Nakov
Abstract:
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual…
▽ More
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Towards Enabling FAIR Dataspaces Using Large Language Models
Authors:
Benedikt T. Arnold,
Johannes Theissen-Lipp,
Diego Collarana,
Christoph Lange,
Sandra Geisler,
Edward Curry,
Stefan Decker
Abstract:
Dataspaces have recently gained adoption across various sectors, including traditionally less digitized domains such as culture. Leveraging Semantic Web technologies helps to make dataspaces FAIR, but their complexity poses a significant challenge to the adoption of dataspaces and increases their cost. The advent of Large Language Models (LLMs) raises the question of how these models can support t…
▽ More
Dataspaces have recently gained adoption across various sectors, including traditionally less digitized domains such as culture. Leveraging Semantic Web technologies helps to make dataspaces FAIR, but their complexity poses a significant challenge to the adoption of dataspaces and increases their cost. The advent of Large Language Models (LLMs) raises the question of how these models can support the adoption of FAIR dataspaces. In this work, we demonstrate the potential of LLMs in dataspaces with a concrete example. We also derive a research agenda for exploring this emerging field.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
Authors:
Yuxia Wang,
Jonibek Mansurov,
Petar Ivanov,
**yan Su,
Artem Shelmanov,
Akim Tsvigun,
Osama Mohanned Afzal,
Tarek Mahmoud,
Giovanni Puccetti,
Thomas Arnold,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych,
Preslav Nakov
Abstract:
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific…
▽ More
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain, and multi-generator corpus of MGTs -- M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.
△ Less
Submitted 27 June, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Authors:
Yuxia Wang,
Jonibek Mansurov,
Petar Ivanov,
**yan Su,
Artem Shelmanov,
Akim Tsvigun,
Chenxi Whitehouse,
Osama Mohammed Afzal,
Tarek Mahmoud,
Toru Sasaki,
Thomas Arnold,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych,
Preslav Nakov
Abstract:
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a la…
▽ More
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark \textbf{M4}, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4.
△ Less
Submitted 9 March, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
DP-Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting
Authors:
Timour Igamberdiev,
Thomas Arnold,
Ivan Habernal
Abstract:
Text rewriting with differential privacy (DP) provides concrete theoretical guarantees for protecting the privacy of individuals in textual documents. In practice, existing systems may lack the means to validate their privacy-preserving claims, leading to problems of transparency and reproducibility. We introduce DP-Rewrite, an open-source framework for differentially private text rewriting which…
▽ More
Text rewriting with differential privacy (DP) provides concrete theoretical guarantees for protecting the privacy of individuals in textual documents. In practice, existing systems may lack the means to validate their privacy-preserving claims, leading to problems of transparency and reproducibility. We introduce DP-Rewrite, an open-source framework for differentially private text rewriting which aims to solve these problems by being modular, extensible, and highly customizable. Our system incorporates a variety of downstream datasets, models, pre-training procedures, and evaluation metrics to provide a flexible way to lead and validate private text rewriting research. To demonstrate our software in practice, we provide a set of experiments as a case study on the ADePT DP text rewriting system, detecting a privacy leak in its pre-training approach. Our system is publicly available, and we hope that it will help the community to make DP text rewriting research more accessible and transparent.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Synthetic-to-Real Domain Adaptation using Contrastive Unpaired Translation
Authors:
Benedikt T. Imbusch,
Max Schwarz,
Sven Behnke
Abstract:
The usefulness of deep learning models in robotics is largely dependent on the availability of training data. Manual annotation of training data is often infeasible. Synthetic data is a viable alternative, but suffers from domain gap. We propose a multi-step method to obtain training data without manual annotation effort: From 3D object meshes, we generate images using a modern synthesis pipeline.…
▽ More
The usefulness of deep learning models in robotics is largely dependent on the availability of training data. Manual annotation of training data is often infeasible. Synthetic data is a viable alternative, but suffers from domain gap. We propose a multi-step method to obtain training data without manual annotation effort: From 3D object meshes, we generate images using a modern synthesis pipeline. We utilize a state-of-the-art image-to-image translation method to adapt the synthetic images to the real domain, minimizing the domain gap in a learned manner. The translation network is trained from unpaired images, i.e. just requires an un-annotated collection of real images. The generated and refined images can then be used to train deep learning models for a particular task. We also propose and evaluate extensions to the translation method that further increase performance, such as patch-based training, which shortens training time and increases global consistency. We evaluate our method and demonstrate its effectiveness on two robotic datasets. We finally give insight into the learned refinement operations.
△ Less
Submitted 28 June, 2022; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Visualizing a Large Spatiotemporal Collection of Historic Photography with a Generous Interface
Authors:
Taylor Arnold,
Nathaniel Ayers,
Justin Madron,
Robert Nelson,
Lauren Tilton
Abstract:
Museums, libraries, and other cultural institutions continue to prioritize and build web-based visualization systems that increase access and discovery to digitized archives. Prominent examples exist that illustrate impressive visualizations of a particular feature of a collection. For example, interactive maps showing geographic spread or timelines capturing the temporal aspects of collections. B…
▽ More
Museums, libraries, and other cultural institutions continue to prioritize and build web-based visualization systems that increase access and discovery to digitized archives. Prominent examples exist that illustrate impressive visualizations of a particular feature of a collection. For example, interactive maps showing geographic spread or timelines capturing the temporal aspects of collections. By way of a case study, this paper presents a new web-based visualization system that allows users to simultaneously explore a large collection of images along several different dimensions---spatial, temporal, visual, textual, and through additional metadata fields including the photographer name---guided by the concept of generous interfaces. The case study is a complete redesign of a previously released digital, public humanities project called Photogrammar (2014). The paper highlights the redesign's interactive visualizations that are now possible by the affordances of newly available software. All of the code is open-source in order to allow for re-use of the codebase to other collections with a similar structure.
△ Less
Submitted 4 September, 2020;
originally announced September 2020.
-
When Exceptions are the Norm: Exploring the Role of Consent in HRI
Authors:
Vasanth Sarathy,
Thomas Arnold,
Matthias Scheutz
Abstract:
HRI researchers have made major strides in develo** robotic architectures that are capable of reading a limited set of social cues and producing behaviors that enhance their likeability and feeling of comfort amongst humans. However, the cues in these models are fairly direct and the interactions largely dyadic. To capture the normative qualities of interaction more robustly, we propose consent…
▽ More
HRI researchers have made major strides in develo** robotic architectures that are capable of reading a limited set of social cues and producing behaviors that enhance their likeability and feeling of comfort amongst humans. However, the cues in these models are fairly direct and the interactions largely dyadic. To capture the normative qualities of interaction more robustly, we propose consent as a distinct, critical area for HRI research. Convening important insights in existing HRI work around topics like touch, proxemics, gaze, and moral norms, the notion of consent reveals key expectations that can shape how a robot acts in social space. By sorting various kinds of consent through social and legal doctrine, we delineate empirical and technical questions to meet consent challenges faced in major application domains and robotic roles. Attention to consent could show, for example, how extraordinary, norm-violating actions can be justified by agents and accepted by those around them. We argue that operationalizing ideas from legal scholarship can better guide how robotic systems might cultivate and sustain proper forms of consent.
△ Less
Submitted 4 February, 2019;
originally announced February 2019.
-
Quasi-Dilemmas for Artificial Moral Agents
Authors:
Daniel Kasenberg,
Vasanth Sarathy,
Thomas Arnold,
Matthias Scheutz,
Tom Williams
Abstract:
In this paper we describe moral quasi-dilemmas (MQDs): situations similar to moral dilemmas, but in which an agent is unsure whether exploring the plan space or the world may reveal a course of action that satisfies all moral requirements. We argue that artificial moral agents (AMAs) should be built to handle MQDs (in particular, by exploring the plan space rather than immediately accepting the in…
▽ More
In this paper we describe moral quasi-dilemmas (MQDs): situations similar to moral dilemmas, but in which an agent is unsure whether exploring the plan space or the world may reveal a course of action that satisfies all moral requirements. We argue that artificial moral agents (AMAs) should be built to handle MQDs (in particular, by exploring the plan space rather than immediately accepting the inevitability of the moral dilemma), and that MQDs may be useful for evaluating AMA architectures.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Cross-Discourse and Multilingual Exploration of Textual Corpora with the DualNeighbors Algorithm
Authors:
Taylor Arnold,
Lauren Tilton
Abstract:
Word choice is dependent on the cultural context of writers and their subjects. Different words are used to describe similar actions, objects, and features based on factors such as class, race, gender, geography and political affinity. Exploratory techniques based on locating and counting words may, therefore, lead to conclusions that reinforce culturally inflected boundaries. We offer a new metho…
▽ More
Word choice is dependent on the cultural context of writers and their subjects. Different words are used to describe similar actions, objects, and features based on factors such as class, race, gender, geography and political affinity. Exploratory techniques based on locating and counting words may, therefore, lead to conclusions that reinforce culturally inflected boundaries. We offer a new method, the DualNeighbors algorithm, for linking thematically similar documents both within and across discursive and linguistic barriers to reveal cross-cultural connections. Qualitative and quantitative evaluations of this technique are shown as applied to two cultural datasets of interest to researchers across the humanities and social sciences. An open-source implementation of the DualNeighbors algorithm is provided to assist in its application.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
Predicting CEFRL levels in learner English on the basis of metrics and full texts
Authors:
Taylor Arnold,
Nicolas Ballier,
Thomas Gaillat,
Paula Lissòn
Abstract:
This paper analyses the contribution of language metrics and, potentially, of linguistic structures, to classify French learners of English according to levels of the Common European Framework of Reference for Languages (CEFRL). The purpose is to build a model for the prediction of learner levels as a function of language complexity features. We used the EFCAMDAT corpus, a database of one million…
▽ More
This paper analyses the contribution of language metrics and, potentially, of linguistic structures, to classify French learners of English according to levels of the Common European Framework of Reference for Languages (CEFRL). The purpose is to build a model for the prediction of learner levels as a function of language complexity features. We used the EFCAMDAT corpus, a database of one million written assignments by learners. After applying language complexity metrics on the texts, we built a representation matching the language metrics of the texts to their assigned CEFRL levels. Lexical and syntactic metrics were computed with LCA, LSA, and koRpus. Several supervised learning models were built by using Gradient Boosted Trees and Keras Neural Network methods and by contrasting pairs of CEFRL levels. Results show that it is possible to implement pairwise distinctions, especially for levels ranging from A1 to B1 (A1=>A2: 0.916 AUC and A2=>B1: 0.904 AUC). Model explanation reveals significant linguistic features for the predictiveness in the corpus. Word tokens and word types appear to play a significant role in determining levels. This shows that levels are highly dependent on specific semantic profiles.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
A Tidy Data Model for Natural Language Processing using cleanNLP
Authors:
Taylor Arnold
Abstract:
The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, s…
▽ More
The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.
△ Less
Submitted 3 May, 2018; v1 submitted 26 March, 2017;
originally announced March 2017.
-
Enabling Basic Normative HRI in a Cognitive Robotic Architecture
Authors:
Vasanth Sarathy,
Jason R. Wilson,
Thomas Arnold,
Matthias Scheutz
Abstract:
Collaborative human activities are grounded in social and moral norms, which humans consciously and subconsciously use to guide and constrain their decision-making and behavior, thereby strengthening their interactions and preventing emotional and physical harm. This type of norm-based processing is also critical for robots in many human-robot interaction scenarios (e.g., when hel** elderly and…
▽ More
Collaborative human activities are grounded in social and moral norms, which humans consciously and subconsciously use to guide and constrain their decision-making and behavior, thereby strengthening their interactions and preventing emotional and physical harm. This type of norm-based processing is also critical for robots in many human-robot interaction scenarios (e.g., when hel** elderly and disabled persons in assisted living facilities, or assisting humans in assembly tasks in factories or even the space station). In this position paper, we will briefly describe how several components in an integrated cognitive architecture can be used to implement processes that are required for normative human-robot interactions, especially in collaborative tasks where actions and situations could potentially be perceived as threatening and thus need a change in course of action to mitigate the perceived threats.
△ Less
Submitted 11 February, 2016;
originally announced February 2016.
-
Sparse Density Representations for Simultaneous Inference on Large Spatial Datasets
Authors:
Taylor Arnold
Abstract:
Large spatial datasets often represent a number of spatial point processes generated by distinct entities or classes of events. When crossed with covariates, such as discrete time buckets, this can quickly result in a data set with millions of individual density estimates. Applications that require simultaneous access to a substantial subset of these estimates become resource constrained when dens…
▽ More
Large spatial datasets often represent a number of spatial point processes generated by distinct entities or classes of events. When crossed with covariates, such as discrete time buckets, this can quickly result in a data set with millions of individual density estimates. Applications that require simultaneous access to a substantial subset of these estimates become resource constrained when densities are stored in complex and incompatible formats. We present a method for representing spatial densities along the nodes of sparsely populated trees. Fast algorithms are provided for performing set operations and queries on the resulting compact tree structures. The speed and simplicity of the approach is demonstrated on both real and simulated spatial data.
△ Less
Submitted 2 October, 2015;
originally announced October 2015.
-
iotools: High-Performance I/O Tools for R
Authors:
Taylor Arnold,
Michael Kane,
Simon Urbanek
Abstract:
The iotools package provides a set of tools for Input/Output (I/O) intensive datasets processing in R (R Core Team, 2014). Efficent parsing methods are included which minimize copying and avoid the use of intermediate string representations whenever possible. Functions for applying chunk-wise operations allow for computing on streaming input as well as arbitrarily large files. We present a set of…
▽ More
The iotools package provides a set of tools for Input/Output (I/O) intensive datasets processing in R (R Core Team, 2014). Efficent parsing methods are included which minimize copying and avoid the use of intermediate string representations whenever possible. Functions for applying chunk-wise operations allow for computing on streaming input as well as arbitrarily large files. We present a set of example use cases for iotools, as well as extensive benchmarks comparing comparable functions provided in both core-R as well as other contributed packages.
△ Less
Submitted 7 April, 2016; v1 submitted 30 September, 2015;
originally announced October 2015.
-
An Entropy Maximizing Geohash for Distributed Spatiotemporal Database Indexing
Authors:
Taylor Arnold
Abstract:
We present a modification of the standard geohash algorithm based on maximum entropy encoding in which the data volume is approximately constant for a given hash prefix length. Distributed spatiotemporal databases, which typically require interleaving spatial and temporal elements into a single key, reap large benefits from a balanced geohash by creating a consistent ratio between spatial and temp…
▽ More
We present a modification of the standard geohash algorithm based on maximum entropy encoding in which the data volume is approximately constant for a given hash prefix length. Distributed spatiotemporal databases, which typically require interleaving spatial and temporal elements into a single key, reap large benefits from a balanced geohash by creating a consistent ratio between spatial and temporal precision even across areas of varying data density. This property is also useful for indexing purely spatial datasets, where the load distribution of large range scans is an important aspect of query performance. We apply our algorithm to data generated proportional to population as given by census block population counts provided from the US Census Bureau.
△ Less
Submitted 16 June, 2015;
originally announced June 2015.
-
Efficient Implementations of the Generalized Lasso Dual Path Algorithm
Authors:
Taylor Arnold,
Ryan Tibshirani
Abstract:
We consider efficient implementations of the generalized lasso dual path algorithm of Tibshirani and Taylor (2011). We first describe a generic approach that covers any penalty matrix D and any (full column rank) matrix X of predictor variables. We then describe fast implementations for the special cases of trend filtering problems, fused lasso problems, and sparse fused lasso problems, both with…
▽ More
We consider efficient implementations of the generalized lasso dual path algorithm of Tibshirani and Taylor (2011). We first describe a generic approach that covers any penalty matrix D and any (full column rank) matrix X of predictor variables. We then describe fast implementations for the special cases of trend filtering problems, fused lasso problems, and sparse fused lasso problems, both with X=I and a general matrix X. These specialized implementations offer a considerable improvement over the generic implementation, both in terms of numerical stability and efficiency of the solution path computation. These algorithms are all available for use in the genlasso R package, which can be found in the CRAN repository.
△ Less
Submitted 3 November, 2014; v1 submitted 13 May, 2014;
originally announced May 2014.