Skip to main content

Showing 1–23 of 23 results for author: McCarthy, A D

.
  1. arXiv:2404.02127  [pdf, other

    cs.CL cs.AI cs.LG

    FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

    Authors: Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, Christopher Manning

    Abstract: Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictio… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    MSC Class: 68T50 ACM Class: I.2

  2. arXiv:2310.13678  [pdf, other

    cs.CL cs.AI cs.LG

    Long-Form Speech Translation through Segmentation with Finite-State Decoding Constraints on Large Language Models

    Authors: Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Ke Wu

    Abstract: One challenge in speech translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we adapt large language models (LLMs) to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We overcome the tendency of hallucination in LLMs… ▽ More

    Submitted 23 October, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

    Comments: accepted to the Findings of EMNLP 2023. arXiv admin note: text overlap with arXiv:2212.09895

  3. arXiv:2302.07912  [pdf, other

    cs.CL

    Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

    Authors: Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John E. Ortega, Gustavo A. Giménez-Lugo, Rolando Coto-Solano, Katharina Kann

    Abstract: Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribu… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: EACL 2023

  4. arXiv:2212.09895  [pdf, other

    cs.CL

    Improved Long-Form Spoken Language Translation with Large Language Models

    Authors: Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Axel H. Ng

    Abstract: A challenge in spoken language translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we fine-tune a general-purpose, large language model to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We compare to several segmen… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

  5. arXiv:2211.16858  [pdf, other

    cs.CL

    A Major Obstacle for NLP Research: Let's Talk about Time Allocation!

    Authors: Katharina Kann, Shiran Dudy, Arya D. McCarthy

    Abstract: The field of natural language processing (NLP) has grown over the last few years: conferences have become larger, we have published an incredible amount of papers, and state-of-the-art research has been implemented in a large variety of customer-facing products. However, this paper argues that we have been less successful than we should have been and reflects on where and how the field fails to ta… ▽ More

    Submitted 30 November, 2022; originally announced November 2022.

    Comments: To appear at EMNLP 2022

  6. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  7. arXiv:2203.08909  [pdf, other

    cs.CL

    Morphological Processing of Low-Resource Languages: Where We Are and What's Next

    Authors: Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann

    Abstract: Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Findings of ACL 2022

  8. arXiv:2203.08850  [pdf, other

    cs.CL

    Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

    Authors: En-Shiun Annie Lee, Sarubi Thillainathan, Shravan Nayak, Surangika Ranathunga, David Ifeoluwa Adelani, Ruisi Su, Arya D. McCarthy

    Abstract: What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) langu… ▽ More

    Submitted 30 April, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: Accepted to Findings of ACL 2022

  9. arXiv:2101.10245  [pdf, other

    cs.HC

    AirWare: Utilizing Embedded Audio and Infrared Signals for In-Air Hand-Gesture Recognition

    Authors: Nibhrat Lohia, Raunak Mundada, Arya D. McCarthy, Eric C. Larson

    Abstract: We introduce AirWare, an in-air hand-gesture recognition system that uses the already embedded speaker and microphone in most electronic devices, together with embedded infrared proximity sensors. Gestures identified by AirWare are performed in the air above a touchscreen or a mobile phone. AirWare utilizes convolutional neural networks to classify a large vocabulary of hand gestures using multi-m… ▽ More

    Submitted 25 January, 2021; originally announced January 2021.

  10. arXiv:2005.00970  [pdf, other

    cs.CL

    Unsupervised Morphological Paradigm Completion

    Authors: Huiming **, Liwei Cai, Yihui Peng, Chen Xia, Arya D. McCarthy, Katharina Kann

    Abstract: We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or… ▽ More

    Submitted 20 May, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted by ACL 2020

  11. arXiv:2005.00626  [pdf, other

    cs.CL

    Predicting Declension Class from Form and Meaning

    Authors: Adina Williams, Tiago Pimentel, Arya D. McCarthy, Hagen Blix, Eleanor Chodroff, Ryan Cotterell

    Abstract: The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits… ▽ More

    Submitted 28 May, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: 14 pages, 2 figures, the is the camera-ready version accepted at the 2020 Annual Conference of the Association for Computational Linguistics (ACL 2020)

  12. arXiv:2002.12231  [pdf, other

    eess.AS cs.CL cs.SD

    SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

    Authors: Arya D. McCarthy, Liezl Puzon, Juan Pino

    Abstract: We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English$\to$French and English$\to$Romanian automatic speech translation (AST) tasks as well as on a low-resource English a… ▽ More

    Submitted 27 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020

  13. The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

    Authors: Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, Mans Hulden

    Abstract: The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low… ▽ More

    Submitted 25 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Presented at SIGMORPHON 2019

    Journal ref: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (2019) 229-244

  14. arXiv:1910.01531  [pdf, other

    cs.CL

    Modeling Color Terminology Across Thousands of Languages

    Authors: Arya D. McCarthy, Winston Wu, Aaron Mueller, Bill Watson, David Yarowsky

    Abstract: There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Co… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

    Comments: Accepted for presentation at EMNLP-IJCNLP 2019

  15. arXiv:1909.09237  [pdf, other

    cs.CL

    Improved Variational Neural Machine Translation by Promoting Mutual Information

    Authors: Arya D. McCarthy, Xian Li, Jiatao Gu, Ning Dong

    Abstract: Posterior collapse plagues VAEs for text, especially for conditional text generation with strong autoregressive decoders. In this work, we address this problem in variational neural machine translation by explicitly promoting mutual information between the latent variables and the data. Our model extends the conditional variational autoencoder (CVAE) with two new ingredients: first, we propose a m… ▽ More

    Submitted 19 September, 2019; originally announced September 2019.

  16. arXiv:1909.06515  [pdf, other

    cs.CL cs.SD eess.AS

    Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

    Authors: Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, Deepak Gopinath

    Abstract: For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and… ▽ More

    Submitted 22 October, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: IWSLT 2019

  17. arXiv:1906.05906  [pdf, other

    cs.CL

    Meaning to Form: Measuring Systematicity as Information

    Authors: Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell

    Abstract: A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram \textit{gl} have any systematic relationship to the meaning of words like \textit{glisten}, \textit{gleam} and \textit{gl… ▽ More

    Submitted 26 July, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: Accepted for publication at ACL 2019

  18. An Exact No Free Lunch Theorem for Community Detection

    Authors: Arya D. McCarthy, Tongfei Chen, Seth Ebner

    Abstract: A precondition for a No Free Lunch theorem is evaluation with a loss function which does not assume a priori superiority of some outputs over others. A previous result for community detection by Peel et al. (2017) relies on a mismatch between the loss function and the problem domain. The loss function computes an expectation over only a subset of the universe of possible outputs; thus, it is only… ▽ More

    Submitted 24 March, 2019; originally announced March 2019.

    Journal ref: Complex Networks and Their Applications VIII. COMPLEX NETWORKS 2019. Studies in Computational Intelligence, vol 881

  19. arXiv:1901.01354  [pdf, other

    cs.SI physics.soc-ph

    Metrics matter in community detection

    Authors: Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, David W. Matula

    Abstract: We present a critical evaluation of normalized mutual information (NMI) as an evaluation metric for community detection. NMI exaggerates the leximin method's performance on weak communities: Does leximin, in finding the trivial singletons clustering, truly outperform eight other community detection methods? Three NMI improvements from the literature are AMI, rrNMI, and cNMI. We show equivalences u… ▽ More

    Submitted 4 January, 2019; originally announced January 2019.

    Journal ref: Complex Networks and Their Applications VIII. COMPLEX NETWORKS 2019. Studies in Computational Intelligence, vol 881

  20. arXiv:1810.11101  [pdf, other

    cs.CL

    UniMorph 2.0: Universal Morphology

    Authors: Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema.… ▽ More

    Submitted 25 February, 2020; v1 submitted 25 October, 2018; originally announced October 2018.

    Comments: LREC 2018

  21. arXiv:1810.07125  [pdf, other

    cs.CL

    The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

    Authors: Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a… ▽ More

    Submitted 25 February, 2020; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: CoNLL 2018. arXiv admin note: text overlap with arXiv:1706.09031

  22. Marrying Universal Dependencies and Universal Morphology

    Authors: Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky

    Abstract: The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. Wi… ▽ More

    Submitted 15 October, 2018; originally announced October 2018.

    Comments: UDW18

    Journal ref: Proceedings of the Second Workshop on Universal Dependencies (2018) 91-101

  23. Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

    Authors: Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya D. McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, Philipp Koehn

    Abstract: To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component's contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surpri… ▽ More

    Submitted 15 January, 2019; v1 submitted 13 September, 2018; originally announced September 2018.

    Comments: presented at WMT 2018. Please cite using the bib entry from here: http://www.statmt.org/wmt18/bib/WMT013.bib

    Journal ref: Proceedings of the Third Conference on Machine Translation: Research Papers (2018) 124-132