-
That's Optional: A Contemporary Exploration of "that" Omission in English Subordinate Clauses
Authors:
Ella Rabinovich
Abstract:
The Uniform Information Density (UID) hypothesis posits that speakers optimize the communicative properties of their utterances by avoiding spikes in information, thereby maintaining a relatively uniform information profile over time. This paper investigates the impact of UID principles on syntactic reduction, specifically focusing on the optional omission of the connector "that" in English subord…
▽ More
The Uniform Information Density (UID) hypothesis posits that speakers optimize the communicative properties of their utterances by avoiding spikes in information, thereby maintaining a relatively uniform information profile over time. This paper investigates the impact of UID principles on syntactic reduction, specifically focusing on the optional omission of the connector "that" in English subordinate clauses. Building upon previous research, we extend our investigation to a larger corpus of written English, utilize contemporary large language models (LLMs) and extend the information-uniformity principles by the notion of entropy, to estimate the UID manifestations in the usecase of syntactic reduction choices.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings
Authors:
Gili Goldin,
Nick Howell,
Noam Ordan,
Ella Rabinovich,
Shuly Wintner
Abstract:
We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of t…
▽ More
We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and factions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of men and women speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society, supporting research in linguistics, political science, communication, law, etc.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Predicting Question-Answering Performance of Large Language Models through Semantic Consistency
Authors:
Ella Rabinovich,
Samuel Ackerman,
Orna Raz,
Eitan Farchi,
Ateret Anaby-Tavor
Abstract:
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the da…
▽ More
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community.
We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction -- predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Reliable and Interpretable Drift Detection in Streams of Short Texts
Authors:
Ella Rabinovich,
Matan Vetzler,
Samuel Ackerman,
Ateret Anaby-Tavor
Abstract:
Data drift is the change in model input data that is one of the key factors leading to machine learning models performance degradation over time. Monitoring drift helps detecting these issues and preventing their harmful consequences. Meaningful drift interpretation is a fundamental step towards effective re-training of the model. In this study we propose an end-to-end framework for reliable model…
▽ More
Data drift is the change in model input data that is one of the key factors leading to machine learning models performance degradation over time. Monitoring drift helps detecting these issues and preventing their harmful consequences. Meaningful drift interpretation is a fundamental step towards effective re-training of the model. In this study we propose an end-to-end framework for reliable model-agnostic change-point detection and interpretation in large task-oriented dialog systems, proven effective in multiple customer deployments. We evaluate our approach and demonstrate its benefits with a novel variant of intent classification training dataset, simulating customer requests to a dialog system. We make the data publicly available.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
Exploration of the Usage of Color Terms by Color-blind Participants in Online Discussion Platforms
Authors:
Ella Rabinovich,
Boaz Carmeli
Abstract:
Prominent questions about the role of sensory vs. linguistic input in the way we acquire and use language have been extensively studied in the psycholinguistic literature. However, the relative effect of various factors in a person's overall experience on their linguistic system remains unclear. We study this question by making a step forward towards a better understanding of the conceptual percep…
▽ More
Prominent questions about the role of sensory vs. linguistic input in the way we acquire and use language have been extensively studied in the psycholinguistic literature. However, the relative effect of various factors in a person's overall experience on their linguistic system remains unclear. We study this question by making a step forward towards a better understanding of the conceptual perception of colors by color-blind individuals, as reflected in their spontaneous linguistic productions. Using a novel and carefully curated dataset, we show that red-green color-blind speakers use the "red" and "green" color terms in less predictable contexts, and in linguistic environments evoking mental image to a lower extent, when compared to their normal-sighted counterparts. These findings shed some new and interesting light on the role of sensory experience on our linguistic system.
△ Less
Submitted 30 October, 2022; v1 submitted 21 October, 2022;
originally announced October 2022.
-
High-quality Conversational Systems
Authors:
Samuel Ackerman,
Ateret Anaby-Tavor,
Eitan Farchi,
Esther Goldbraich,
George Kour,
Ella Rabinovich,
Orna Raz,
Saritha Route,
Marcel Zalmanovici,
Naama Zwerdling
Abstract:
Conversational systems or chatbots are an example of AI-Infused Applications (AIIA). Chatbots are especially important as they are often the first interaction of clients with a business and are the entry point of a business into the AI (Artificial Intelligence) world. The quality of the chatbot is, therefore, key. However, as is the case in general with AIIAs, it is especially challenging to asses…
▽ More
Conversational systems or chatbots are an example of AI-Infused Applications (AIIA). Chatbots are especially important as they are often the first interaction of clients with a business and are the entry point of a business into the AI (Artificial Intelligence) world. The quality of the chatbot is, therefore, key. However, as is the case in general with AIIAs, it is especially challenging to assess and control the quality of chatbot systems. Beyond the inherent statistical nature of these systems, where occasional failure is acceptable, we identify two major challenges. The first is to release an initial system that is of sufficient quality such that humans will interact with it. The second is to maintain the quality, enhance its capabilities, improve it and make necessary adjustments based on changing user requests or drift. These challenges exist because it is impossible to predict the real distribution of user requests and the natural language they will use to express these requests. Moreover, any empirical distribution of requests is likely to change over time. This may be due to periodicity, changing usage, and drift of topics.
We provide a methodology and set of technologies to address these challenges and to provide automated assistance through a human-in-the-loop approach. We notice that it is crucial to connect between the different phases in the lifecycle of the chatbot development and to make sure it provides its expected business value. For example, that it frees human agents to deal with tasks other than answering human users. Our methodology and technologies apply during chatbot training in the pre-production phase, through to chatbot usage in the field in the post-production phase. They implement the `test first' paradigm by assisting in agile design, and support continuous integration through actionable insights.
△ Less
Submitted 28 April, 2022; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Gaining Insights into Unrecognized User Utterances in Task-Oriented Dialog Systems
Authors:
Ella Rabinovich,
Matan Vetzler,
David Boaz,
Vineet Kumar,
Gaurav Pandey,
Ateret Anaby-Tavor
Abstract:
The rapidly growing market demand for automatic dialogue agents capable of goal-oriented behavior has caused many tech-industry leaders to invest considerable efforts into task-oriented dialog systems. The success of these systems is highly dependent on the accuracy of their intent identification -- the process of deducing the goal or meaning of the user's request and map** it to one of the know…
▽ More
The rapidly growing market demand for automatic dialogue agents capable of goal-oriented behavior has caused many tech-industry leaders to invest considerable efforts into task-oriented dialog systems. The success of these systems is highly dependent on the accuracy of their intent identification -- the process of deducing the goal or meaning of the user's request and map** it to one of the known intents for further processing. Gaining insights into unrecognized utterances -- user requests the systems fail to attribute to a known intent -- is therefore a key process in continuous improvement of goal-oriented dialog systems.
We present an end-to-end pipeline for processing unrecognized user utterances, deployed in a real-world, commercial task-oriented dialog system, including a specifically-tailored clustering algorithm, a novel approach to cluster representative extraction, and cluster naming. We evaluated the proposed components, demonstrating their benefits in the analysis of unrecognized user requests.
△ Less
Submitted 24 October, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
We've had this conversation before: A Novel Approach to Measuring Dialog Similarity
Authors:
Ofer Lavi,
Ella Rabinovich,
Segev Shlomov,
David Boaz,
Inbal Ronen,
Ateret Anaby-Tavor
Abstract:
Dialog is a core building block of human natural language interactions. It contains multi-party utterances used to convey information from one party to another in a dynamic and evolving manner. The ability to compare dialogs is beneficial in many real world use cases, such as conversation analytics for contact center calls and virtual agent design.
We propose a novel adaptation of the edit dista…
▽ More
Dialog is a core building block of human natural language interactions. It contains multi-party utterances used to convey information from one party to another in a dynamic and evolving manner. The ability to compare dialogs is beneficial in many real world use cases, such as conversation analytics for contact center calls and virtual agent design.
We propose a novel adaptation of the edit distance metric to the scenario of dialog similarity. Our approach takes into account various conversation aspects such as utterance semantics, conversation flow, and the participants. We evaluate this new approach and compare it to existing document similarity measures on two publicly available datasets. The results demonstrate that our method outperforms the other approaches in capturing dialog flow, and is better aligned with the human perception of conversation similarity.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Quantifying Cognitive Factors in Lexical Decline
Authors:
David Francis,
Ella Rabinovich,
Farhan Samir,
David Mortensen,
Suzanne Stevenson
Abstract:
We adopt an evolutionary view on language change in which cognitive factors (in addition to social ones) affect the fitness of words and their success in the linguistic ecosystem. Specifically, we propose a variety of psycholinguistic factors -- semantic, distributional, and phonological -- that we hypothesize are predictive of lexical decline, in which words greatly decrease in frequency over tim…
▽ More
We adopt an evolutionary view on language change in which cognitive factors (in addition to social ones) affect the fitness of words and their success in the linguistic ecosystem. Specifically, we propose a variety of psycholinguistic factors -- semantic, distributional, and phonological -- that we hypothesize are predictive of lexical decline, in which words greatly decrease in frequency over time. Using historical data across three languages (English, French, and German), we find that most of our proposed factors show a significant difference in the expected direction between each curated set of declining words and their matched stable words. Moreover, logistic regression analyses show that semantic and distributional factors are significant in predicting declining words. Further diachronic analysis reveals that declining words tend to decrease in the diversity of their lexical contexts over time, gradually narrowing their 'ecological niches'.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Pick a Fight or Bite your Tongue: Investigation of Gender Differences in Idiomatic Language Usage
Authors:
Ella Rabinovich,
Hila Gonen,
Suzanne Stevenson
Abstract:
A large body of research on gender-linked language has established foundations regarding cross-gender differences in lexical, emotional, and topical preferences, along with their sociological underpinnings. We compile a novel, large and diverse corpus of spontaneous linguistic productions annotated with speakers' gender, and perform a first large-scale empirical study of distinctions in the usage…
▽ More
A large body of research on gender-linked language has established foundations regarding cross-gender differences in lexical, emotional, and topical preferences, along with their sociological underpinnings. We compile a novel, large and diverse corpus of spontaneous linguistic productions annotated with speakers' gender, and perform a first large-scale empirical study of distinctions in the usage of \textit{figurative language} between male and female authors. Our analyses suggest that (1) idiomatic choices reflect gender-specific lexical and semantic preferences in general language, (2) men's and women's idiomatic usages express higher emotion than their literal language, with detectable, albeit more subtle, differences between male and female authors along the dimension of dominance compared to similar distinctions in their literal utterances, and (3) contextual analysis of idiomatic expressions reveals considerable differences, reflecting subtle divergences in usage environments, shaped by cross-gender communication styles and semantic biases.
△ Less
Submitted 31 October, 2020;
originally announced November 2020.
-
Exploration of Gender Differences in COVID-19 Discourse on Reddit
Authors:
Jai Aggarwal,
Ella Rabinovich,
Suzanne Stevenson
Abstract:
Decades of research on differences in the language of men and women have established postulates about preferences in lexical, topical, and emotional expression between the two genders, along with their sociological underpinnings. Using a novel dataset of male and female linguistic productions collected from the Reddit discussion platform, we further confirm existing assumptions about gender-linked…
▽ More
Decades of research on differences in the language of men and women have established postulates about preferences in lexical, topical, and emotional expression between the two genders, along with their sociological underpinnings. Using a novel dataset of male and female linguistic productions collected from the Reddit discussion platform, we further confirm existing assumptions about gender-linked affective distinctions, and demonstrate that these distinctions are amplified in social media postings involving emotionally-charged discourse related to COVID-19. Our analysis also confirms considerable differences in topical preferences between male and female authors in spontaneous pandemic-related discussions.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
The Typology of Polysemy: A Multilingual Distributional Framework
Authors:
Ella Rabinovich,
Yang Xu,
Suzanne Stevenson
Abstract:
Lexical semantic typology has identified important cross-linguistic generalizations about the variation and commonalities in polysemy patterns---how languages package up meanings into words. Recent computational research has enabled investigation of lexical semantics at a much larger scale, but little work has explored lexical typology across semantic domains, nor the factors that influence cross-…
▽ More
Lexical semantic typology has identified important cross-linguistic generalizations about the variation and commonalities in polysemy patterns---how languages package up meanings into words. Recent computational research has enabled investigation of lexical semantics at a much larger scale, but little work has explored lexical typology across semantic domains, nor the factors that influence cross-linguistic similarities. We present a novel computational framework that quantifies semantic affinity, the cross-linguistic similarity of lexical semantics for a concept. Our approach defines a common multilingual semantic space that enables a direct comparison of the lexical expression of concepts across languages. We validate our framework against empirical findings on lexical semantic typology at both the concept and domain levels. Our results reveal an intricate interaction between semantic domains and extra-linguistic factors, beyond language phylogeny, that co-shape the typology of polysemy across languages.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods
Authors:
Maria Ryskina,
Ella Rabinovich,
Taylor Berg-Kirkpatrick,
David R. Mortensen,
Yulia Tsvetkov
Abstract:
We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find…
▽ More
We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find more support for the latter hypothesis. Besides presenting a new linguistic application of distributional semantics, this study tackles the linguistic question of the role of language-internal factors (in our case, sparsity) in language change motivated by language-external factors (reflected in frequency growth).
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Say Anything: Automatic Semantic Infelicity Detection in L2 English Indefinite Pronouns
Authors:
Ella Rabinovich,
Julia Watson,
Barend Beekhuizen,
Suzanne Stevenson
Abstract:
Computational research on error detection in second language speakers has mainly addressed clear grammatical anomalies typical to learners at the beginner-to-intermediate level. We focus instead on acquisition of subtle semantic nuances of English indefinite pronouns by non-native speakers at varying levels of proficiency. We first lay out theoretical, linguistically motivated hypotheses, and supp…
▽ More
Computational research on error detection in second language speakers has mainly addressed clear grammatical anomalies typical to learners at the beginner-to-intermediate level. We focus instead on acquisition of subtle semantic nuances of English indefinite pronouns by non-native speakers at varying levels of proficiency. We first lay out theoretical, linguistically motivated hypotheses, and supporting empirical evidence on the nature of the challenges posed by indefinite pronouns to English learners. We then suggest and evaluate an automatic approach for detection of atypical usage patterns, demonstrating that deep learning architectures are promising for this task involving nuanced semantic anomalies.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums
Authors:
Ella Rabinovich,
Masih Sultani,
Suzanne Stevenson
Abstract:
In contrast to many decades of research on oral code-switching, the study of written multilingual productions has only recently enjoyed a surge of interest. Many open questions remain regarding the sociolinguistic underpinnings of written code-switching, and progress has been limited by a lack of suitable resources. We introduce a novel, large, and diverse dataset of written code-switched producti…
▽ More
In contrast to many decades of research on oral code-switching, the study of written multilingual productions has only recently enjoyed a surge of interest. Many open questions remain regarding the sociolinguistic underpinnings of written code-switching, and progress has been limited by a lack of suitable resources. We introduce a novel, large, and diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far. We investigate whether findings in oral code-switching concerning content and style, as well as speaker proficiency, are carried over into written code-switching in discussion forums. The released dataset can further facilitate a range of research and practical activities.
△ Less
Submitted 30 August, 2019;
originally announced August 2019.
-
Controversy in Context
Authors:
Benjamin Sznajder,
Ariel Gera,
Yonatan Bilu,
Dafna Sheinwald,
Ella Rabinovich,
Ranit Aharonov,
David Konopnicki,
Noam Slonim
Abstract:
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and,…
▽ More
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.
-
Learning Concept Abstractness Using Weak Supervision
Authors:
Ella Rabinovich,
Benjamin Sznajder,
Artem Spector,
Ilya Shnayderman,
Ranit Aharonov,
David Konopnicki,
Noam Slonim
Abstract:
We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The results imply the applicability of this appro…
▽ More
We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The results imply the applicability of this approach to additional properties of concepts, additional languages, and resource-scarce scenarios.
△ Less
Submitted 4 September, 2018;
originally announced September 2018.
-
Native Language Cognate Effects on Second Language Lexical Choice
Authors:
Ella Rabinovich,
Yulia Tsvetkov,
Shuly Wintner
Abstract:
We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are abl…
▽ More
We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages. We quantitatively analyze non-native lexical choice, highlighting cognate facilitation as one of the important phenomena sha** the language of non-native speakers.
△ Less
Submitted 24 May, 2018;
originally announced May 2018.
-
The UN Parallel Corpus Annotated for Translation Direction
Authors:
Elad Tolochinsky,
Ohad Mosafi,
Ella Rabinovich,
Shuly Wintner
Abstract:
This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods a…
▽ More
This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods as well as the ability to distinguish between translated and original texts in the different languages. The annotated corpus is publicly available.
△ Less
Submitted 19 May, 2018;
originally announced May 2018.
-
Found in Translation: Reconstructing Phylogenetic Language Trees from Translations
Authors:
Ella Rabinovich,
Noam Ordan,
Shuly Wintner
Abstract:
Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it is possible to uncover the history of the source…
▽ More
Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it is possible to uncover the history of the source language by looking only at the translation. Specifically, we automatically reconstruct phylogenetic language trees from monolingual texts (translated from several source languages). The signal of the source language is so powerful that it is retained even after two phases of translation. This strongly indicates that source language interference is the most dominant characteristic of translated texts, overshadowing the more subtle signals of universal properties of translation.
△ Less
Submitted 24 April, 2017;
originally announced April 2017.
-
Personalized Machine Translation: Preserving Original Author Traits
Authors:
Ella Rabinovich,
Shachar Mirkin,
Raj Nath Patel,
Lucia Specia,
Shuly Wintner
Abstract:
The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author's gender has a powerful, clear signal in originals texts, but this signal is obfuscated in hum…
▽ More
The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author's gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.
△ Less
Submitted 12 January, 2017; v1 submitted 18 October, 2016;
originally announced October 2016.
-
Unsupervised Identification of Translationese
Authors:
Ella Rabinovich,
Shuly Wintner
Abstract:
Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained…
▽ More
Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained on. We show that this is indeed the case, in a variety of evaluation scenarios. We then show that unsupervised classification is highly accurate on this task. We suggest a method for determining the correct labels of the clustering outcomes, and then use the labels for voting, improving the accuracy even further. Moreover, we suggest a simple method for clustering in the challenging case of mixed-domain datasets, in spite of the dominance of domain-related features over translation-related ones. The result is an effective, fully-unsupervised method for distinguishing between original and translated texts that can be applied to new domains with reasonable accuracy.
△ Less
Submitted 11 September, 2016;
originally announced September 2016.
-
On the Similarities Between Native, Non-native and Translated Texts
Authors:
Ella Rabinovich,
Sergiu Nisioi,
Noam Ordan,
Shuly Wintner
Abstract:
We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable…
▽ More
We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable; (2) non-native language and translations are closer to each other than each of them is to native language; and (3) some of these characteristics depend on the source or native language, while others do not, reflecting, perhaps, unified principles that similarly affect translations and non-native language.
△ Less
Submitted 11 September, 2016;
originally announced September 2016.
-
A Parallel Corpus of Translationese
Authors:
Ella Rabinovich,
Shuly Wintner,
Ofek Luis Lewinsohn
Abstract:
We describe a set of bilingual English--French and English--German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) tra…
▽ More
We describe a set of bilingual English--French and English--German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.
△ Less
Submitted 6 March, 2016; v1 submitted 11 September, 2015;
originally announced September 2015.