-
The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings
Authors:
Gili Goldin,
Nick Howell,
Noam Ordan,
Ella Rabinovich,
Shuly Wintner
Abstract:
We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of t…
▽ More
We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and factions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of men and women speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society, supporting research in linguistics, political science, communication, law, etc.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Shared Lexical Items as Triggers of Code Switching
Authors:
Shuly Wintner,
Safaa Shehadi,
Yuli Zeira,
Doreen Osmelak,
Yuval Nov
Abstract:
Why do bilingual speakers code-switch (mix their two languages)? Among the several theories that attempt to explain this natural and ubiquitous phenomenon, the Triggering Hypothesis relates code-switching to the presence of lexical triggers, specifically cognates and proper names, adjacent to the switch point. We provide a fuller, more nuanced and refined exploration of the triggering hypothesis,…
▽ More
Why do bilingual speakers code-switch (mix their two languages)? Among the several theories that attempt to explain this natural and ubiquitous phenomenon, the Triggering Hypothesis relates code-switching to the presence of lexical triggers, specifically cognates and proper names, adjacent to the switch point. We provide a fuller, more nuanced and refined exploration of the triggering hypothesis, based on five large datasets in three language pairs, reflecting both spoken and written bilingual interactions. Our results show that words that are assumed to reside in a mental lexicon shared by both languages indeed trigger code-switching; that the tendency to switch depends on the distance of the trigger from the switch point; and on whether the trigger precedes or succeeds the switch; but not on the etymology of the trigger words. We thus provide strong, robust, evidence-based confirmation to several hypotheses on the relationships between lexical triggers and code-switching.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
Speaker Information Can Guide Models to Better Inductive Biases: A Case Study On Predicting Code-Switching
Authors:
Alissa Ostapenko,
Shuly Wintner,
Melinda Fricke,
Yulia Tsvetkov
Abstract:
Natural language processing (NLP) models trained on people-generated data can be unreliable because, without any constraints, they can learn from spurious correlations that are not relevant to the task. We hypothesize that enriching models with speaker information in a controlled, educated way can guide them to pick up on relevant inductive biases. For the speaker-driven task of predicting code-sw…
▽ More
Natural language processing (NLP) models trained on people-generated data can be unreliable because, without any constraints, they can learn from spurious correlations that are not relevant to the task. We hypothesize that enriching models with speaker information in a controlled, educated way can guide them to pick up on relevant inductive biases. For the speaker-driven task of predicting code-switching points in English--Spanish bilingual dialogues, we show that adding sociolinguistically-grounded speaker features as prepended prompts significantly improves accuracy. We find that by adding influential phrases to the input, speaker-informed models learn useful and explainable linguistic information. To our knowledge, we are the first to incorporate speaker characteristics in a neural model for code-switching, and more generally, take a step towards develo** transparent, personalized models that use speaker information in a controlled way.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
Machine Translation into Low-resource Language Varieties
Authors:
Sachin Kumar,
Antonios Anastasopoulos,
Shuly Wintner,
Yulia Tsvetkov
Abstract:
State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a g…
▽ More
State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source--variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English--Russian MT system to generate Ukrainian and Belarusian, an English--Norwegian Bokmål system to generate Nynorsk, and an English--Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.
△ Less
Submitted 15 October, 2021; v1 submitted 12 June, 2021;
originally announced June 2021.
-
Topics to Avoid: Demoting Latent Confounds in Text Classification
Authors:
Sachin Kumar,
Shuly Wintner,
Noah A. Smith,
Yulia Tsvetkov
Abstract:
Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topi…
▽ More
Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.
△ Less
Submitted 14 June, 2021; v1 submitted 1 September, 2019;
originally announced September 2019.
-
Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies
Authors:
Anjalie Field,
Doron Kliger,
Shuly Wintner,
Jennifer Pan,
Dan Jurafsky,
Yulia Tsvetkov
Abstract:
Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and "fake news'". Here, we draw on two concepts from the political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the R…
▽ More
Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and "fake news'". Here, we draw on two concepts from the political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the Russian newspaper Izvestia and identify a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.
△ Less
Submitted 29 October, 2018; v1 submitted 28 August, 2018;
originally announced August 2018.
-
Native Language Cognate Effects on Second Language Lexical Choice
Authors:
Ella Rabinovich,
Yulia Tsvetkov,
Shuly Wintner
Abstract:
We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are abl…
▽ More
We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages. We quantitatively analyze non-native lexical choice, highlighting cognate facilitation as one of the important phenomena sha** the language of non-native speakers.
△ Less
Submitted 24 May, 2018;
originally announced May 2018.
-
The UN Parallel Corpus Annotated for Translation Direction
Authors:
Elad Tolochinsky,
Ohad Mosafi,
Ella Rabinovich,
Shuly Wintner
Abstract:
This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods a…
▽ More
This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods as well as the ability to distinguish between translated and original texts in the different languages. The annotated corpus is publicly available.
△ Less
Submitted 19 May, 2018;
originally announced May 2018.
-
Found in Translation: Reconstructing Phylogenetic Language Trees from Translations
Authors:
Ella Rabinovich,
Noam Ordan,
Shuly Wintner
Abstract:
Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it is possible to uncover the history of the source…
▽ More
Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it is possible to uncover the history of the source language by looking only at the translation. Specifically, we automatically reconstruct phylogenetic language trees from monolingual texts (translated from several source languages). The signal of the source language is so powerful that it is retained even after two phases of translation. This strongly indicates that source language interference is the most dominant characteristic of translated texts, overshadowing the more subtle signals of universal properties of translation.
△ Less
Submitted 24 April, 2017;
originally announced April 2017.
-
Personalized Machine Translation: Preserving Original Author Traits
Authors:
Ella Rabinovich,
Shachar Mirkin,
Raj Nath Patel,
Lucia Specia,
Shuly Wintner
Abstract:
The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author's gender has a powerful, clear signal in originals texts, but this signal is obfuscated in hum…
▽ More
The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author's gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.
△ Less
Submitted 12 January, 2017; v1 submitted 18 October, 2016;
originally announced October 2016.
-
Unsupervised Identification of Translationese
Authors:
Ella Rabinovich,
Shuly Wintner
Abstract:
Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained…
▽ More
Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained on. We show that this is indeed the case, in a variety of evaluation scenarios. We then show that unsupervised classification is highly accurate on this task. We suggest a method for determining the correct labels of the clustering outcomes, and then use the labels for voting, improving the accuracy even further. Moreover, we suggest a simple method for clustering in the challenging case of mixed-domain datasets, in spite of the dominance of domain-related features over translation-related ones. The result is an effective, fully-unsupervised method for distinguishing between original and translated texts that can be applied to new domains with reasonable accuracy.
△ Less
Submitted 11 September, 2016;
originally announced September 2016.
-
On the Similarities Between Native, Non-native and Translated Texts
Authors:
Ella Rabinovich,
Sergiu Nisioi,
Noam Ordan,
Shuly Wintner
Abstract:
We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable…
▽ More
We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable; (2) non-native language and translations are closer to each other than each of them is to native language; and (3) some of these characteristics depend on the source or native language, while others do not, reflecting, perhaps, unified principles that similarly affect translations and non-native language.
△ Less
Submitted 11 September, 2016;
originally announced September 2016.
-
A Parallel Corpus of Translationese
Authors:
Ella Rabinovich,
Shuly Wintner,
Ofek Luis Lewinsohn
Abstract:
We describe a set of bilingual English--French and English--German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) tra…
▽ More
We describe a set of bilingual English--French and English--German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.
△ Less
Submitted 6 March, 2016; v1 submitted 11 September, 2015;
originally announced September 2015.
-
Amalia -- A Unified Platform for Parsing and Generation
Authors:
Shuly Wintner,
Evgeniy Gabrilovich,
Nissim Francez
Abstract:
Contemporary linguistic theories (in particular, HPSG) are declarative in nature: they specify constraints on permissible structures, not how such structures are to be computed. Grammars designed under such theories are, therefore, suitable for both parsing and generation. However, practical implementations of such theories don't usually support bidirectional processing of grammars. We present a…
▽ More
Contemporary linguistic theories (in particular, HPSG) are declarative in nature: they specify constraints on permissible structures, not how such structures are to be computed. Grammars designed under such theories are, therefore, suitable for both parsing and generation. However, practical implementations of such theories don't usually support bidirectional processing of grammars. We present a grammar development system that includes a compiler of grammars (for parsing and generation) to abstract machine instructions, and an interpreter for the abstract machine language. The generation compiler inverts input grammars (designed for parsing) to a form more suitable for generation. The compiled grammars are then executed by the interpreter using one control strategy, regardless of whether the grammar is the original or the inverted version. We thus obtain a unified, efficient platform for develo** reversible grammars.
△ Less
Submitted 24 September, 1997;
originally announced September 1997.
-
An Abstract Machine for Unification Grammars
Authors:
Shuly Wintner
Abstract:
This work describes the design and implementation of an abstract machine, Amalia, for the linguistic formalism ALE, which is based on typed feature structures. This formalism is one of the most widely accepted in computational linguistics and has been used for designing grammars in various linguistic theories, most notably HPSG. Amalia is composed of data structures and a set of instructions, au…
▽ More
This work describes the design and implementation of an abstract machine, Amalia, for the linguistic formalism ALE, which is based on typed feature structures. This formalism is one of the most widely accepted in computational linguistics and has been used for designing grammars in various linguistic theories, most notably HPSG. Amalia is composed of data structures and a set of instructions, augmented by a compiler from the grammatical formalism to the abstract instructions, and a (portable) interpreter of the abstract instructions. The effect of each instruction is defined using a low-level language that can be executed on ordinary hardware.
The advantages of the abstract machine approach are twofold. From a theoretical point of view, the abstract machine gives a well-defined operational semantics to the grammatical formalism. This ensures that grammars specified using our system are endowed with well defined meaning. It enables, for example, to formally verify the correctness of a compiler for HPSG, given an independent definition. From a practical point of view, Amalia is the first system that employs a direct compilation scheme for unification grammars that are based on typed feature structures. The use of amalia results in a much improved performance over existing systems.
In order to test the machine on a realistic application, we have developed a small-scale, HPSG-based grammar for a fragment of the Hebrew language, using Amalia as the development platform. This is the first application of HPSG to a Semitic language.
△ Less
Submitted 23 September, 1997;
originally announced September 1997.
-
Off-line Parsability and the Well-foundedness of Subsumption
Authors:
Shuly Wintner,
Nissim Francez
Abstract:
Typed feature structures are used extensively for the specification of linguistic information in many formalisms. The subsumption relation orders TFSs by their information content. We prove that subsumption of acyclic TFSs is well-founded, whereas in the presence of cycles general TFS subsumption is not well-founded. We show an application of this result for parsing, where the well-foundedness o…
▽ More
Typed feature structures are used extensively for the specification of linguistic information in many formalisms. The subsumption relation orders TFSs by their information content. We prove that subsumption of acyclic TFSs is well-founded, whereas in the presence of cycles general TFS subsumption is not well-founded. We show an application of this result for parsing, where the well-foundedness of subsumption is used to guarantee termination for grammars that are off-line parsable. We define a new version of off-line parsability that is less strict than the existing one; thus termination is guaranteed for parsing with a larger set of grammars.
△ Less
Submitted 23 September, 1997;
originally announced September 1997.
-
Parsing with Typed Feature Structures
Authors:
Shuly Wintner,
Nissim Francez
Abstract:
In this paper we provide for parsing with respect to grammars expressed in a general TFS-based formalism, a restriction of ALE. Our motivation being the design of an abstract (WAM-like) machine for the formalism, we consider parsing as a computational process and use it as an operational semantics to guide the design of the control structures for the abstract machine.
We emphasize the notion of…
▽ More
In this paper we provide for parsing with respect to grammars expressed in a general TFS-based formalism, a restriction of ALE. Our motivation being the design of an abstract (WAM-like) machine for the formalism, we consider parsing as a computational process and use it as an operational semantics to guide the design of the control structures for the abstract machine.
We emphasize the notion of abstract typed feature structures (AFSs) that encode the essential information of TFSs and define unification over AFSs rather than over TFSs. We then introduce an explicit construct of multi-rooted feature structures (MRSs) that naturally extend TFSs and use them to represent phrasal signs as well as grammar rules. We also employ abstractions of MRSs and give the mathematical foundations needed for manipulating them. We formally define grammars and the languages they generate, and then describe a model for computation that corresponds to bottom-up chart parsing: grammars written in the TFS-based formalism are executed by the parser. We show that the computation is correct with respect to the independent definition. Finally, we discuss the class of grammars for which computations terminate and prove that termination can be guaranteed for off-line parsable grammars.
△ Less
Submitted 31 January, 1996;
originally announced January 1996.
-
Parsing with Typed Feature Structures
Authors:
Shuly Wintner,
Nissim Francez
Abstract:
In this paper we provide for parsing with respect to grammars expressed in a general TFS-based formalism, a restriction of ALE. Our motivation being the design of an abstract (WAM-like) machine for the formalism, we consider parsing as a computational process and use it as an operational semantics to guide the design of the control structures for the abstract machine.
We emphasize the notion of…
▽ More
In this paper we provide for parsing with respect to grammars expressed in a general TFS-based formalism, a restriction of ALE. Our motivation being the design of an abstract (WAM-like) machine for the formalism, we consider parsing as a computational process and use it as an operational semantics to guide the design of the control structures for the abstract machine.
We emphasize the notion of abstract typed feature structures (AFSs) that encode the essential information of TFSs and define unification over AFSs rather than over TFSs. We then introduce an explicit construct of multi-rooted feature structures (MRSs) that naturally extend TFSs and use them to represent phrasal signs as well as grammar rules. We also employ abstractions of MRSs and give the mathematical foundations needed for manipulating them. We then present a simple bottom-up chart parser as a model for computation: grammars written in the TFS-based formalism are executed by the parser. Finally, we show that the parser is correct.
△ Less
Submitted 31 January, 1996;
originally announced January 1996.
-
Abstract Machine for Typed Feature Structures
Authors:
Shuly Wintner,
Nissim Francez
Abstract:
This paper describes an abstract machine for linguistic formalisms that are based on typed feature structures, such as HPSG. The core design of the abstract machine is given in detail, including the compilation process from a high-level language to the abstract machine language and the implementation of the abstract instructions. The machine's engine supports the unification of typed, possibly c…
▽ More
This paper describes an abstract machine for linguistic formalisms that are based on typed feature structures, such as HPSG. The core design of the abstract machine is given in detail, including the compilation process from a high-level language to the abstract machine language and the implementation of the abstract instructions. The machine's engine supports the unification of typed, possibly cyclic, feature structures. A separate module deals with control structures and instructions to accommodate parsing for phrase structure grammars. We treat the linguistic formalism as a high-level declarative programming language, applying methods that were proved useful in computer science to the study of natural languages: a grammar specified using the formalism is endowed with an operational semantics.
△ Less
Submitted 13 April, 1995;
originally announced April 1995.
-
Abstract Machine for Typed Feature Structures
Authors:
Shuly Wintner,
Nissim Francez
Abstract:
This paper describes a first step towards the definition of an abstract machine for linguistic formalisms that are based on typed feature structures, such as HPSG. The core design of the abstract machine is given in detail, including the compilation process from a high-level specification language to the abstract machine language and the implementation of the abstract instructions. We thus apply…
▽ More
This paper describes a first step towards the definition of an abstract machine for linguistic formalisms that are based on typed feature structures, such as HPSG. The core design of the abstract machine is given in detail, including the compilation process from a high-level specification language to the abstract machine language and the implementation of the abstract instructions. We thus apply methods that were proved useful in computer science to the study of natural languages: a grammar specified using the formalism is endowed with an operational semantics. Currently, our machine supports the unification of simple feature structures, unification of sequences of such structures, cyclic structures and disjunction.
△ Less
Submitted 17 July, 1994;
originally announced July 1994.