-
FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning
Authors:
Joel Niklaus,
Lucia Zheng,
Arya D. McCarthy,
Christopher Hahn,
Brian M. Rosen,
Peter Henderson,
Daniel E. Ho,
Garrett Honke,
Percy Liang,
Christopher Manning
Abstract:
Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictio…
▽ More
Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictions, 24 languages and a total of 12M examples. We present evidence that domain-specific pretraining and instruction tuning improve performance on LegalBench, including improving Flan-T5 XL by 8 points or 16\% over the baseline. However, the effect does not generalize across all tasks, training regimes, model sizes, and other factors. LawInstruct is a resource for accelerating the development of models with stronger information processing and decision making capabilities in the legal domain.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Long-Form Speech Translation through Segmentation with Finite-State Decoding Constraints on Large Language Models
Authors:
Arya D. McCarthy,
Hao Zhang,
Shankar Kumar,
Felix Stahlberg,
Ke Wu
Abstract:
One challenge in speech translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we adapt large language models (LLMs) to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We overcome the tendency of hallucination in LLMs…
▽ More
One challenge in speech translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we adapt large language models (LLMs) to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We overcome the tendency of hallucination in LLMs by incorporating finite-state constraints during decoding; these eliminate invalid outputs without requiring additional training. We discover that LLMs are adaptable to transcripts containing ASR errors through prompt-tuning or fine-tuning. Relative to a state-of-the-art automatic punctuation baseline, our best LLM improves the average BLEU by 2.9 points for English-German, English-Spanish, and English-Arabic TED talk translation in 9 test sets, just by improving segmentation.
△ Less
Submitted 23 October, 2023; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Authors:
Abteen Ebrahimi,
Arya D. McCarthy,
Arturo Oncevay,
Luis Chiruzzo,
John E. Ortega,
Gustavo A. Giménez-Lugo,
Rolando Coto-Solano,
Katharina Kann
Abstract:
Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribu…
▽ More
Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Improved Long-Form Spoken Language Translation with Large Language Models
Authors:
Arya D. McCarthy,
Hao Zhang,
Shankar Kumar,
Felix Stahlberg,
Axel H. Ng
Abstract:
A challenge in spoken language translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we fine-tune a general-purpose, large language model to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We compare to several segmen…
▽ More
A challenge in spoken language translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we fine-tune a general-purpose, large language model to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We compare to several segmentation strategies and find that our approach improves BLEU score on three languages by an average of 2.7 BLEU overall compared to an automatic punctuation baseline. Further, we demonstrate the effectiveness of two constrained decoding strategies to improve well-formedness of the model output from above 99% to 100%.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
A Major Obstacle for NLP Research: Let's Talk about Time Allocation!
Authors:
Katharina Kann,
Shiran Dudy,
Arya D. McCarthy
Abstract:
The field of natural language processing (NLP) has grown over the last few years: conferences have become larger, we have published an incredible amount of papers, and state-of-the-art research has been implemented in a large variety of customer-facing products. However, this paper argues that we have been less successful than we should have been and reflects on where and how the field fails to ta…
▽ More
The field of natural language processing (NLP) has grown over the last few years: conferences have become larger, we have published an incredible amount of papers, and state-of-the-art research has been implemented in a large variety of customer-facing products. However, this paper argues that we have been less successful than we should have been and reflects on where and how the field fails to tap its full potential. Specifically, we demonstrate that, in recent years, subpar time allocation has been a major obstacle for NLP research. We outline multiple concrete problems together with their negative consequences and, importantly, suggest remedies to improve the status quo. We hope that this paper will be a starting point for discussions around which common practices are -- or are not -- beneficial for NLP research.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
UniMorph 4.0: Universal Morphology
Authors:
Khuyagbaatar Batsuren,
Omer Goldman,
Salam Khalifa,
Nizar Habash,
Witold Kieraś,
Gábor Bella,
Brian Leonard,
Garrett Nicolai,
Kyle Gorman,
Yustinus Ghanggo Ate,
Maria Ryskina,
Sabrina J. Mielke,
Elena Budianskaya,
Charbel El-Khaissi,
Tiago Pimentel,
Michael Gasser,
William Lane,
Mohit Raj,
Matt Coler,
Jaime Rafael Montoya Samame,
Delio Siticonatzi Camaiteri,
Benoît Sagot,
Esaú Zumaeta Rojas,
Didier López Francis,
Arturo Oncevay
, et al. (71 additional authors not shown)
Abstract:
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa…
▽ More
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
△ Less
Submitted 19 June, 2022; v1 submitted 7 May, 2022;
originally announced May 2022.
-
Morphological Processing of Low-Resource Languages: Where We Are and What's Next
Authors:
Adam Wiemerslage,
Miikka Silfverberg,
Changbing Yang,
Arya D. McCarthy,
Garrett Nicolai,
Eliana Colunga,
Katharina Kann
Abstract:
Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent…
▽ More
Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent developments in computational morphology with a focus on low-resource languages. Second, we argue that the field is ready to tackle the logical next challenge: understanding a language's morphology from raw text alone. We perform an empirical study on a truly unsupervised version of the paradigm completion task and show that, while existing state-of-the-art models bridged by two newly proposed models we devise perform reasonably, there is still much room for improvement. The stakes are high: solving this task will increase the language coverage of morphological resources by a number of magnitudes.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
Authors:
En-Shiun Annie Lee,
Sarubi Thillainathan,
Shravan Nayak,
Surangika Ranathunga,
David Ifeoluwa Adelani,
Ruisi Su,
Arya D. McCarthy
Abstract:
What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) langu…
▽ More
What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) language typology. In addition to yielding several heuristics, the experiments form a framework for evaluating the data sensitivities of machine translation systems. While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU. In answer to our title's question, mBART is not a low-resource panacea; we therefore encourage shifting the emphasis from new models to new data.
△ Less
Submitted 30 April, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
AirWare: Utilizing Embedded Audio and Infrared Signals for In-Air Hand-Gesture Recognition
Authors:
Nibhrat Lohia,
Raunak Mundada,
Arya D. McCarthy,
Eric C. Larson
Abstract:
We introduce AirWare, an in-air hand-gesture recognition system that uses the already embedded speaker and microphone in most electronic devices, together with embedded infrared proximity sensors. Gestures identified by AirWare are performed in the air above a touchscreen or a mobile phone. AirWare utilizes convolutional neural networks to classify a large vocabulary of hand gestures using multi-m…
▽ More
We introduce AirWare, an in-air hand-gesture recognition system that uses the already embedded speaker and microphone in most electronic devices, together with embedded infrared proximity sensors. Gestures identified by AirWare are performed in the air above a touchscreen or a mobile phone. AirWare utilizes convolutional neural networks to classify a large vocabulary of hand gestures using multi-modal audio Doppler signatures and infrared (IR) sensor information. As opposed to other systems which use high frequency Doppler radars or depth cameras to uniquely identify in-air gestures, AirWare does not require any external sensors. In our analysis, we use openly available APIs to interface with the Samsung Galaxy S5 audio and proximity sensors for data collection. We find that AirWare is not reliable enough for a deployable interaction system when trying to classify a gesture set of 21 gestures, with an average true positive rate of only 50.5% per gesture. To improve performance, we train AirWare to identify subsets of the 21 gestures vocabulary based on possible usage scenarios. We find that AirWare can identify three gesture sets with average true positive rate greater than 80% using 4--7 gestures per set, which comprises a vocabulary of 16 unique in-air gestures.
△ Less
Submitted 25 January, 2021;
originally announced January 2021.
-
Unsupervised Morphological Paradigm Completion
Authors:
Huiming **,
Liwei Cai,
Yihui Peng,
Chen Xia,
Arya D. McCarthy,
Katharina Kann
Abstract:
We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or…
▽ More
We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or to assist linguistic annotators. From a cognitive science perspective, this can shed light on how children acquire morphological knowledge. We further introduce a system for the task, which generates morphological paradigms via the following steps: (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. We perform an evaluation on 14 typologically diverse languages. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.
△ Less
Submitted 20 May, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Predicting Declension Class from Form and Meaning
Authors:
Adina Williams,
Tiago Pimentel,
Arya D. McCarthy,
Hagen Blix,
Eleanor Chodroff,
Ryan Cotterell
Abstract:
The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits…
▽ More
The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits, we can glean about declension class from knowing the form and/or meaning of nouns. We know that form and meaning are often also indicative of grammatical gender---which, as we quantitatively verify, can itself share information with declension class---so we also control for gender. We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender). The three-way interaction between class, form, and meaning (given gender) is also significant. Our study is important for two reasons: First, we introduce a new method that provides additional quantitative support for a classic linguistic finding that form and meaning are relevant for the classification of nouns into declensions. Secondly, we show not only that individual declensions classes vary in the strength of their clues within a language, but also that these variations themselves vary across languages.
△ Less
Submitted 28 May, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.
-
SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation
Authors:
Arya D. McCarthy,
Liezl Puzon,
Juan Pino
Abstract:
We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English$\to$French and English$\to$Romanian automatic speech translation (AST) tasks as well as on a low-resource English a…
▽ More
We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English$\to$French and English$\to$Romanian automatic speech translation (AST) tasks as well as on a low-resource English automatic speech recognition (ASR) task. Further, in ablations, we show the benefits of both quantity and diversity in augmented data. Finally, we show that we can combine our approach with augmentation by machine-translated transcripts to obtain a competitive end-to-end AST model that outperforms a very strong cascade model on an English$\to$French AST task. Our method is sufficiently general that it can be applied to other speech generation and analysis tasks.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
Authors:
Arya D. McCarthy,
Ekaterina Vylomova,
Shijie Wu,
Chaitanya Malaviya,
Lawrence Wolf-Sonkin,
Garrett Nicolai,
Christo Kirov,
Miikka Silfverberg,
Sabrina J. Mielke,
Jeffrey Heinz,
Ryan Cotterell,
Mans Hulden
Abstract:
The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low…
▽ More
The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year's strong baselines or highly ranked systems from previous years' shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.
△ Less
Submitted 25 February, 2020; v1 submitted 24 October, 2019;
originally announced October 2019.
-
Modeling Color Terminology Across Thousands of Languages
Authors:
Arya D. McCarthy,
Winston Wu,
Aaron Mueller,
Bill Watson,
David Yarowsky
Abstract:
There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Co…
▽ More
There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Collectively, the 14 empirically-grounded computational linguistic metrics we design---as well as their aggregation---correlate strongly with both the Berlin and Kay basic/secondary color term partition (gamma=0.96) and their hypothesized universal acquisition sequence. The measures and result provide further empirical evidence from computational linguistics in support of their claims, as well as additional nuance: they suggest treating the partition as a spectrum instead of a dichotomy.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
Improved Variational Neural Machine Translation by Promoting Mutual Information
Authors:
Arya D. McCarthy,
Xian Li,
Jiatao Gu,
Ning Dong
Abstract:
Posterior collapse plagues VAEs for text, especially for conditional text generation with strong autoregressive decoders. In this work, we address this problem in variational neural machine translation by explicitly promoting mutual information between the latent variables and the data. Our model extends the conditional variational autoencoder (CVAE) with two new ingredients: first, we propose a m…
▽ More
Posterior collapse plagues VAEs for text, especially for conditional text generation with strong autoregressive decoders. In this work, we address this problem in variational neural machine translation by explicitly promoting mutual information between the latent variables and the data. Our model extends the conditional variational autoencoder (CVAE) with two new ingredients: first, we propose a modified evidence lower bound (ELBO) objective which explicitly promotes mutual information; second, we regularize the probabilities of the decoder by mixing an auxiliary factorized distribution which is directly predicted by the latent variables. We present empirical results on the Transformer architecture and show the proposed model effectively addressed posterior collapse: latent variables are no longer ignored in the presence of powerful decoder. As a result, the proposed model yields improved translation quality while demonstrating superior performance in terms of data efficiency and robustness.
△ Less
Submitted 19 September, 2019;
originally announced September 2019.
-
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade
Authors:
Juan Pino,
Liezl Puzon,
Jiatao Gu,
Xutai Ma,
Arya D. McCarthy,
Deepak Gopinath
Abstract:
For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and…
▽ More
For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data. The same end-to-end approach plus fine-tuning closes the gap on the English--Romanian MuST-C dataset from 6.7 to 3.7 BLEU. In addition to these results, we present practical recommendations for augmentation and pretraining approaches. Finally, we decrease the performance gap to 0.01 BLEU using a Transformer-based architecture.
△ Less
Submitted 22 October, 2019; v1 submitted 13 September, 2019;
originally announced September 2019.
-
Meaning to Form: Measuring Systematicity as Information
Authors:
Tiago Pimentel,
Arya D. McCarthy,
Damián E. Blasi,
Brian Roark,
Ryan Cotterell
Abstract:
A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram \textit{gl} have any systematic relationship to the meaning of words like \textit{glisten}, \textit{gleam} and \textit{gl…
▽ More
A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram \textit{gl} have any systematic relationship to the meaning of words like \textit{glisten}, \textit{gleam} and \textit{glow}? In this work, we offer a holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. We employ these in a data-driven and massively multilingual approach to the question, examining 106 languages. We find a statistically significant reduction in entropy when modeling a word form conditioned on its semantic representation. Encouragingly, we also recover well-attested English examples of systematic affixes. We conclude with the meta-point: Our approximate effect size (measured in bits) is quite small---despite some amount of systematicity between form and meaning, an arbitrary relationship and its resulting benefits dominate human language.
△ Less
Submitted 26 July, 2019; v1 submitted 13 June, 2019;
originally announced June 2019.
-
An Exact No Free Lunch Theorem for Community Detection
Authors:
Arya D. McCarthy,
Tongfei Chen,
Seth Ebner
Abstract:
A precondition for a No Free Lunch theorem is evaluation with a loss function which does not assume a priori superiority of some outputs over others. A previous result for community detection by Peel et al. (2017) relies on a mismatch between the loss function and the problem domain. The loss function computes an expectation over only a subset of the universe of possible outputs; thus, it is only…
▽ More
A precondition for a No Free Lunch theorem is evaluation with a loss function which does not assume a priori superiority of some outputs over others. A previous result for community detection by Peel et al. (2017) relies on a mismatch between the loss function and the problem domain. The loss function computes an expectation over only a subset of the universe of possible outputs; thus, it is only asymptotically appropriate with respect to the problem size. By using the correct random model for the problem domain, we provide a stronger, exact No Free Lunch theorem for community detection. The claim generalizes to other set-partitioning tasks including core/periphery separation, $k$-clustering, and graph partitioning. Finally, we review the literature of proposed evaluation functions and identify functions which (perhaps with slight modifications) are compatible with an exact No Free Lunch theorem.
△ Less
Submitted 24 March, 2019;
originally announced March 2019.
-
Metrics matter in community detection
Authors:
Arya D. McCarthy,
Tongfei Chen,
Rachel Rudinger,
David W. Matula
Abstract:
We present a critical evaluation of normalized mutual information (NMI) as an evaluation metric for community detection. NMI exaggerates the leximin method's performance on weak communities: Does leximin, in finding the trivial singletons clustering, truly outperform eight other community detection methods? Three NMI improvements from the literature are AMI, rrNMI, and cNMI. We show equivalences u…
▽ More
We present a critical evaluation of normalized mutual information (NMI) as an evaluation metric for community detection. NMI exaggerates the leximin method's performance on weak communities: Does leximin, in finding the trivial singletons clustering, truly outperform eight other community detection methods? Three NMI improvements from the literature are AMI, rrNMI, and cNMI. We show equivalences under relevant random models, and for evaluating community detection, we advise one-sided AMI under the $\mathbb{M}_{\mathrm{all}}$ model (all partitions of $n$ nodes). This work seeks (1) to start a conversation on robust measurements, and (2) to advocate evaluations which do not give "free lunch".
△ Less
Submitted 4 January, 2019;
originally announced January 2019.
-
UniMorph 2.0: Universal Morphology
Authors:
Christo Kirov,
Ryan Cotterell,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Patrick Xia,
Manaal Faruqui,
Sabrina J. Mielke,
Arya D. McCarthy,
Sandra Kübler,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema.…
▽ More
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema. Additional supporting data and tools are also released on a per-language basis when available. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland and is sponsored by the DARPA LORELEI program. This paper details advances made to the collection, annotation, and dissemination of project resources since the initial UniMorph release described at LREC 2016. lexical resources} }
△ Less
Submitted 25 February, 2020; v1 submitted 25 October, 2018;
originally announced October 2018.
-
The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Authors:
Ryan Cotterell,
Christo Kirov,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Arya D. McCarthy,
Katharina Kann,
Sabrina J. Mielke,
Garrett Nicolai,
Miikka Silfverberg,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a…
▽ More
The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task. This second task featured seven languages. Task 1 received 27 submissions and task 2 received 6 submissions. Both tasks featured a low, medium, and high data condition. Nearly all submissions featured a neural component and built on highly-ranked systems from the earlier 2017 shared task. In the inflection task (task 1), 41 of the 52 languages present in last year's inflection task showed improvement by the best systems in the low-resource setting. The cloze task (task 2) proved to be difficult, and few submissions managed to consistently improve upon both a simple neural baseline system and a lemma-repeating baseline.
△ Less
Submitted 25 February, 2020; v1 submitted 16 October, 2018;
originally announced October 2018.
-
Marrying Universal Dependencies and Universal Morphology
Authors:
Arya D. McCarthy,
Miikka Silfverberg,
Ryan Cotterell,
Mans Hulden,
David Yarowsky
Abstract:
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. Wi…
▽ More
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project's annotations could be used to validate the other's. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic map** from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.
△ Less
Submitted 15 October, 2018;
originally announced October 2018.
-
Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation
Authors:
Brian Thompson,
Huda Khayrallah,
Antonios Anastasopoulos,
Arya D. McCarthy,
Kevin Duh,
Rebecca Marvin,
Paul McNamee,
Jeremy Gwinnup,
Tim Anderson,
Philipp Koehn
Abstract:
To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component's contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surpri…
▽ More
To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component's contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.
△ Less
Submitted 15 January, 2019; v1 submitted 13 September, 2018;
originally announced September 2018.