-
OCNLI: Original Chinese Natural Language Inference
Authors:
Hai Hu,
Kyle Richardson,
Liang Xu,
Lu Li,
Sandra Kuebler,
Lawrence S. Moss
Abstract:
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world's languages. In this paper, we present the first large-scale NLI dataset (consisting of ~56,000 annotated sentence pairs) f…
▽ More
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world's languages. In this paper, we present the first large-scale NLI dataset (consisting of ~56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (~12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese NLU. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity
Authors:
Hai Hu,
Qi Chen,
Kyle Richardson,
Atreyee Mukherjee,
Lawrence S. Moss,
Sandra Kuebler
Abstract:
We present a new logic-based inference engine for natural language inference (NLI) called MonaLog, which is based on natural logic and the monotonicity calculus. In contrast to existing logic-based approaches, our system is intentionally designed to be as lightweight as possible, and operates using a small set of well-known (surface-level) monotonicity facts about quantifiers, lexical items and to…
▽ More
We present a new logic-based inference engine for natural language inference (NLI) called MonaLog, which is based on natural logic and the monotonicity calculus. In contrast to existing logic-based approaches, our system is intentionally designed to be as lightweight as possible, and operates using a small set of well-known (surface-level) monotonicity facts about quantifiers, lexical items and tokenlevel polarity information. Despite its simplicity, we find our approach to be competitive with other logic-based NLI models on the SICK benchmark. We also use MonaLog in combination with the current state-of-the-art model BERT in a variety of settings, including for compositional data augmentation. We show that MonaLog is capable of generating large amounts of high-quality training data for BERT, improving its accuracy on SICK.
△ Less
Submitted 19 October, 2019;
originally announced October 2019.
-
UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs
Authors:
Jian Zhu,
Zuoyu Tian,
Sandra Kübler
Abstract:
This paper describes the UM-IU@LING's system for the SemEval 2019 Task 6: OffensEval. We take a mixed approach to identify and categorize hate speech in social media. In subtask A, we fine-tuned a BERT based classifier to detect abusive content in tweets, achieving a macro F1 score of 0.8136 on the test data, thus reaching the 3rd rank out of 103 submissions. In subtasks B and C, we used a linear…
▽ More
This paper describes the UM-IU@LING's system for the SemEval 2019 Task 6: OffensEval. We take a mixed approach to identify and categorize hate speech in social media. In subtask A, we fine-tuned a BERT based classifier to detect abusive content in tweets, achieving a macro F1 score of 0.8136 on the test data, thus reaching the 3rd rank out of 103 submissions. In subtasks B and C, we used a linear SVM with selected character n-gram features. For subtask C, our system could identify the target of abuse with a macro F1 score of 0.5243, ranking it 27th out of 65 submissions.
△ Less
Submitted 6 April, 2019;
originally announced April 2019.
-
UniMorph 2.0: Universal Morphology
Authors:
Christo Kirov,
Ryan Cotterell,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Patrick Xia,
Manaal Faruqui,
Sabrina J. Mielke,
Arya D. McCarthy,
Sandra Kübler,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema.…
▽ More
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema. Additional supporting data and tools are also released on a per-language basis when available. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland and is sponsored by the DARPA LORELEI program. This paper details advances made to the collection, annotation, and dissemination of project resources since the initial UniMorph release described at LREC 2016. lexical resources} }
△ Less
Submitted 25 February, 2020; v1 submitted 25 October, 2018;
originally announced October 2018.
-
Detecting Syntactic Features of Translated Chinese
Authors:
Hai Hu,
Wen Li,
Sandra Kübler
Abstract:
We present a machine learning approach to distinguish texts translated to Chinese (by humans) from texts originally written in Chinese, with a focus on a wide range of syntactic features. Using Support Vector Machines (SVMs) as classifier on a genre-balanced corpus in translation studies of Chinese, we find that constituent parse trees and dependency triples as features without lexical information…
▽ More
We present a machine learning approach to distinguish texts translated to Chinese (by humans) from texts originally written in Chinese, with a focus on a wide range of syntactic features. Using Support Vector Machines (SVMs) as classifier on a genre-balanced corpus in translation studies of Chinese, we find that constituent parse trees and dependency triples as features without lexical information perform very well on the task, with an F-measure above 90%, close to the results of lexical n-gram features, without the risk of learning topic information rather than translation features. Thus, we claim syntactic features alone can accurately distinguish translated from original Chinese. Translated Chinese exhibits an increased use of determiners, subject position pronouns, NP + 'de' as NP modifiers, multiple NPs or VPs conjoined by a Chinese specific punctuation, among other structures. We also interpret the syntactic features with reference to previous translation studies in Chinese, particularly the usage of pronouns.
△ Less
Submitted 23 April, 2018;
originally announced April 2018.
-
Nano-scale characterization of the formation of silver layers during electroless deposition on polymeric surfaces
Authors:
Aniruddha Dutta,
Biao Yuan,
Helge Heinrich,
Christopher N. Grabill,
Stephen M. Kuebler,
Aniket Bhattacharya
Abstract:
We report here a quantitative method of Transmission Electron Microscopy (TEM) to measure the shapes, sizes and volumes of nanoparticles which are responsible for their properties. Gold nanoparticles (Au NPs) acting as nucleating agents for the electroless deposition of silver NPs on SU-8 polymers were analyzed in this project. The atomic-number contrast (Z-contrast) imaging technique reveals the…
▽ More
We report here a quantitative method of Transmission Electron Microscopy (TEM) to measure the shapes, sizes and volumes of nanoparticles which are responsible for their properties. Gold nanoparticles (Au NPs) acting as nucleating agents for the electroless deposition of silver NPs on SU-8 polymers were analyzed in this project. The atomic-number contrast (Z-contrast) imaging technique reveals the height and effective diameter of each Au NP and a volume distribution is obtained. Varying the reducing agents produced Au NPs of different sizes which were found both on the polymer surface and in some cases buried several nanometers below the surface. The morphology of Au NPs is an important factor for systems that use surface-bound nanoparticles as nucleation sites as in electroless metallization. Electrolessly deposited silver layers reduced by hydroquinone on SU-8 polymer are analyzed in this project.
△ Less
Submitted 20 November, 2017;
originally announced November 2017.
-
Information Storage and Retrieval using Macromolecules as Storage Media
Authors:
M. Mansuripur,
P. K. Khulbe,
S. M. Kuebler,
J. W. Perry,
M. S. Giridhar,
J. Kevin Erwin,
Kibyung Seong,
Seth Marder,
N. Peyghambarian
Abstract:
To store information at extremely high-density and data-rate, we propose to adapt, integrate, and extend the techniques developed by chemists and molecular biologists for the purpose of manipulating biological and other macromolecules. In principle, volumetric densities in excess of 10^21 bits/cm^3 can be achieved when individual molecules having dimensions below a nanometer or so are used to enco…
▽ More
To store information at extremely high-density and data-rate, we propose to adapt, integrate, and extend the techniques developed by chemists and molecular biologists for the purpose of manipulating biological and other macromolecules. In principle, volumetric densities in excess of 10^21 bits/cm^3 can be achieved when individual molecules having dimensions below a nanometer or so are used to encode the 0's and 1's of a binary string of data. In practice, however, given the limitations of electron-beam lithography, thin film deposition and patterning technologies, molecular manipulation in submicron dimensions, etc., we believe that volumetric storage densities on the order of 10^16 bits/cm^3 (i.e., petabytes per cubic centimeter) should be readily attainable, leaving plenty of room for future growth. The unique feature of the proposed new approach is its focus on the feasibility of storing bits of information in individual molecules, each only a few angstroms in size.
△ Less
Submitted 26 August, 2017;
originally announced August 2017.
-
CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Authors:
Ryan Cotterell,
Christo Kirov,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Patrick Xia,
Manaal Faruqui,
Sandra Kübler,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by…
▽ More
The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.
△ Less
Submitted 4 July, 2017; v1 submitted 27 June, 2017;
originally announced June 2017.
-
Performing Stance Detection on Twitter Data using Computational Linguistics Techniques
Authors:
Gourav G. Shenoy,
Erika H. Dsouza,
Sandra Kübler
Abstract:
As humans, we can often detect from a persons utterances if he or she is in favor of or against a given target entity (topic, product, another person, etc). But from the perspective of a computer, we need means to automatically deduce the stance of the tweeter, given just the tweet text. In this paper, we present our results of performing stance detection on twitter data using a supervised approac…
▽ More
As humans, we can often detect from a persons utterances if he or she is in favor of or against a given target entity (topic, product, another person, etc). But from the perspective of a computer, we need means to automatically deduce the stance of the tweeter, given just the tweet text. In this paper, we present our results of performing stance detection on twitter data using a supervised approach. We begin by extracting bag-of-words to perform classification using TIMBL, then try and optimize the features to improve stance detection accuracy, followed by extending the dataset with two sets of lexicons - arguing, and MPQA subjectivity; next we explore the MALT parser and construct features using its dependency triples, finally we perform analysis using Scikit-learn Random Forest implementation.
△ Less
Submitted 6 March, 2017;
originally announced March 2017.
-
VUV Spectroscopic Study of the D^1ΠState of Molecular Deuterium
Authors:
G. D. Dickenson,
T. I. Ivanov,
W. Ubachs,
M. Roudjane,
N. de Oliveira,
D. Joyeux,
L. Nahon,
W. -Ü L. Tchang-Brillet,
M. Glass-Maujean,
H. Schmoranzer,
A. Knie,
S. Kübler,
A. Ehresmann
Abstract:
The D^1Π_u - X^1Σ_g^+ absorption system of molecular deuterium has been re-investigated using the VUV Fourier -Transform (FT) spectrometer at the DESIRS beamline of the synchrotron SOLEIL and photon-induced fluorescence spectrometry (PIFS) using the 10 m normal incidence monochromator at the synchrotron BESSY II. Using the FT spectrometer absorption spectra in the range 72 - 82 nm were recorded in…
▽ More
The D^1Π_u - X^1Σ_g^+ absorption system of molecular deuterium has been re-investigated using the VUV Fourier -Transform (FT) spectrometer at the DESIRS beamline of the synchrotron SOLEIL and photon-induced fluorescence spectrometry (PIFS) using the 10 m normal incidence monochromator at the synchrotron BESSY II. Using the FT spectrometer absorption spectra in the range 72 - 82 nm were recorded in quasi static gas at 100 K and in a free flowing jet at a spectroscopic resolution of 0.50 and 0.20 cm^{-1} respectively . The narrow Q-branch transitions, probing states of Π^- symmetry, were observed up to vibrational level v = 22. The states of Π^+ symmetry, known to be broadened due to predissociation and giving rise to asymmetric Beutler-Fano resonances, were studied up to v = 18. The 10 m normal incidence beamline setup at BESSY II was used to simultaneously record absorption, dissociation, ionization and fluorescence decay channels from which information on the line intensities, predissociated widths, and Fano q-parameters were extracted. R-branch transitions were observed up to v = 23 for J = 1-3 as well as several transitions for J = 4 and 5 up to v = 22 and 18 respectively. The Q-branch transitions are found to weakly predissociate and were observed from v = 8 to the final vibrational level of the state v = 23. The spectroscopic study is supported by two theoretical frameworks. Results on the Π^- symmetry states are compared to ab initio multi-channel-quantum defect theory (MQDT) calculations, demonstrating that these calculations are accurate to within 0.5 cm^-1.
△ Less
Submitted 3 January, 2013;
originally announced January 2013.
-
Key Factors for Information Dissemination on Communicating Products and Fixed Databases
Authors:
Sylvain Kubler,
William Derigent,
André Thomas,
Eric Rondeau
Abstract:
Intelligent products carrying their own information are more and more present nowadays. In recent years, some authors argued the usage of such products for the Supply Chain Management Industry. Indeed, a multitude of informational vectors take place in such environments like fixed databases or manufactured products on which we are able to embed significant proportion of data. By considering distri…
▽ More
Intelligent products carrying their own information are more and more present nowadays. In recent years, some authors argued the usage of such products for the Supply Chain Management Industry. Indeed, a multitude of informational vectors take place in such environments like fixed databases or manufactured products on which we are able to embed significant proportion of data. By considering distributed database systems, we can allocate specific data fragments to the product useful to manage its own evolution. The paper aims to analyze the Supply Chain performance according to different strategies of information distribution. Thus, different distribution patterns between informational vectors are studied. The purpose is to determine the key factors which lead to improve information distribution performance in term of time properties.
△ Less
Submitted 23 June, 2011;
originally announced June 2011.