Skip to main content

Showing 1–50 of 58 results for author: Koehn, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13748  [pdf, other

    cs.CL cs.LG

    Every Language Counts: Learn and Unlearn in Multilingual LLMs

    Authors: Taiming Lu, Philipp Koehn

    Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated con… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  2. arXiv:2406.03869  [pdf, other

    cs.CL

    Recovering document annotations for sentence-level bitext

    Authors: Rachel Wicks, Matt Post, Philipp Koehn

    Abstract: Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets hav… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Findings

  3. arXiv:2405.20389  [pdf, other

    astro-ph.IM cs.AI cs.HC cs.IR

    Designing an Evaluation Framework for Large Language Models in Astronomy Research

    Authors: John F. Wu, Alina Hyk, Kiera McCormick, Christine Ye, Simone Astarita, Elina Baral, Jo Ciuca, Jesse Cranney, Anjalie Field, Kartheik Iyer, Philipp Koehn, Jenn Kotler, Sandor Kruk, Michelle Ntampaka, Charles O'Neill, Joshua E. G. Peek, Sanjib Sharma, Mikaeel Yunus

    Abstract: Large Language Models (LLMs) are shifting how scientific research is done. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. However, there is currently no standard for evaluating the use of LLMs in astronomy. Therefore, we present the experimental design for an evaluation study on how astronomy rese… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 7 pages, 3 figures. Code available at https://github.com/jsalt2024-evaluating-llms-for-astronomy/astro-arxiv-bot

  4. arXiv:2405.13274  [pdf, other

    cs.CL

    DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

    Authors: Weiting Tan, **gyu Zhang, Lingfeng Shen, Daniel Khashabi, Philipp Koehn

    Abstract: Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and lingu… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  5. arXiv:2403.10963  [pdf, other

    cs.CL

    Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!

    Authors: Niyati Bafna, Philipp Koehn, David Yarowsky

    Abstract: While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural "shortcuts", such as copying subwords from the source to the target, given that such la… ▽ More

    Submitted 17 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

    Comments: 5 pages, Accepted at Workshop on Insights from Negative Results in NLP (NAACL) 2024

  6. arXiv:2402.01172  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming Sequence Transduction through Dynamic Compression

    Authors: Weiting Tan, Yunmo Chen, Tongfei Chen, Guanghui Qin, Haoran Xu, Heidi C. Zhang, Benjamin Van Durme, Philipp Koehn

    Abstract: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrat… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  7. arXiv:2401.13136  [pdf, other

    cs.CL cs.AI

    The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts

    Authors: Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, **gyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi

    Abstract: As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  8. arXiv:2311.03127  [pdf, other

    cs.CL cs.AI

    Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs

    Authors: Longyue Wang, Zhaopeng Tu, Yan Gu, Siyou Liu, Dian Yu, Qingsong Ma, Chenyang Lyu, Liting Zhou, Chao-Hong Liu, Yufeng Ma, Weiyu Chen, Yvette Graham, Bonnie Webber, Philipp Koehn, Andy Way, Yulin Yuan, Shuming Shi

    Abstract: Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the first edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel co… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: WMT2023 Discourse-Level Literary Translation Shared Task Overview Paper

  9. arXiv:2311.02310  [pdf, other

    cs.CL

    Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles

    Authors: Weiting Tan, Haoran Xu, Lingfeng Shen, Shuyue Stella Li, Kenton Murray, Philipp Koehn, Benjamin Van Durme, Yunmo Chen

    Abstract: Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

  10. arXiv:2310.00840  [pdf, other

    cs.CL

    Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

    Authors: Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

    Abstract: Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective tha… ▽ More

    Submitted 18 March, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  11. arXiv:2305.14280  [pdf, other

    cs.CL

    Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer

    Authors: Elizabeth Salesky, Neha Verma, Philipp Koehn, Matt Post

    Abstract: We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations. We experiment with two different data settings with a variety of language and script coverage, demonstrating improved performance compared to subword embeddings. We explore various properties of pixel representations such as parameter sharing within and across scripts to better… ▽ More

    Submitted 24 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  12. arXiv:2305.13993  [pdf, other

    cs.CL

    Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

    Authors: Haoran Xu, Weiting Tan, Shuyue Stella Li, Yunmo Chen, Benjamin Van Durme, Philipp Koehn, Kenton Murray

    Abstract: Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fu… ▽ More

    Submitted 22 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted at the main conference of EMNLP 2023

  13. arXiv:2210.14378  [pdf, other

    cs.CL cs.LG

    Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport

    Authors: Kelly Marchisio, Ali Saad-Eldin, Kevin Duh, Carey Priebe, Philipp Koehn

    Abstract: Bilingual lexicons form a critical component of various natural language processing applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. We improve bilingual lexicon induction performance across 40 language pairs with a graph-matching method based on optimal transport. The method is especially strong with low amounts of supervision.

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 Camera-Ready

  14. arXiv:2210.05098  [pdf, other

    cs.CL cs.LG

    IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

    Authors: Kelly Marchisio, Neha Verma, Kevin Duh, Philipp Koehn

    Abstract: The ability to extract high-quality translation dictionaries from monolingual word embedding spaces depends critically on the geometric similarity of the spaces -- their degree of "isomorphism." We address the root-cause of faulty cross-lingual map**: that word embedding training resulted in the underlying spaces being non-isomorphic. We incorporate global measures of isomorphism directly into t… ▽ More

    Submitted 4 July, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: Updated EMNLP2022 Camera Ready (citation correction, removed references to dimensionality reduction [was not used here].)

  15. arXiv:2210.05033  [pdf, other

    cs.CL

    Multilingual Representation Distillation with Contrastive Learning

    Authors: Weiting Tan, Kevin Heffernan, Holger Schwenk, Philipp Koehn

    Abstract: Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences (i.e., find semantically similar sentences that can… ▽ More

    Submitted 30 April, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  16. arXiv:2208.11194  [pdf, other

    cs.CL

    Bitext Mining for Low-Resource Languages via Contrastive Learning

    Authors: Weiting Tan, Philipp Koehn

    Abstract: Mining high-quality bitexts for low-resource languages is challenging. This paper shows that sentence representation of language models fine-tuned with multiple negatives ranking loss, a contrastive objective, helps retrieve clean bitexts. Experiments show that parallel data mined from our approach substantially outperform the previous state-of-the-art method on low resource languages Khmer and Pa… ▽ More

    Submitted 23 August, 2022; originally announced August 2022.

  17. arXiv:2207.04672  [pdf

    cs.CL cs.AI

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Authors: NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran , et al. (14 additional authors not shown)

    Abstract: Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality res… ▽ More

    Submitted 25 August, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: 190 pages

    MSC Class: 68T50 ACM Class: I.2.7

  18. arXiv:2205.11416  [pdf, other

    cs.CL

    The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains

    Authors: Haoran Xu, Philipp Koehn, Kenton Murray

    Abstract: Recent model pruning methods have demonstrated the ability to remove redundant parameters without sacrificing model performance. Common methods remove redundant parameters according to the parameter sensitivity, a gradient-based measure reflecting the contribution of the parameters. In this paper, however, we argue that redundant parameters can be trained to make beneficial contributions. We first… ▽ More

    Submitted 22 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: Accepted at EMNLP 2022

  19. arXiv:2205.08533  [pdf, ps, other

    cs.CL

    Consistent Human Evaluation of Machine Translation across Language Pairs

    Authors: Daniel Licht, Cynthia Gao, Janice Lam, Francisco Guzman, Mona Diab, Philipp Koehn

    Abstract: Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given the high variability between human evaluators, partly due to subjective expectations for translation quality for different language pairs. We propose a new metric called XSTS that is more focused on semantic equivalence and a cross-lingual calibration method that enables more cons… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

    Comments: 10 pages

  20. Learn To Remember: Transformer with Recurrent Memory for Document-Level Machine Translation

    Authors: Yukun Feng, Feng Li, Ziang Song, Boyuan Zheng, Philipp Koehn

    Abstract: The Transformer architecture has led to significant gains in machine translation. However, most studies focus on only sentence-level translation without considering the context dependency within documents, leading to the inadequacy of document-level coherence. Some recent research tried to mitigate this issue by introducing an additional context encoder or translating with multiple sentences or ev… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: Accepted by NAACL-2022 Findings

    Journal ref: Findings of the Association for Computational Linguistics: NAACL 2022, 1409--1420

  21. arXiv:2203.13867  [pdf, other

    cs.CL cs.LG

    Data Selection Curriculum for Neural Machine Translation

    Authors: Tasnim Mohiuddin, Philipp Koehn, Vishrav Chaudhary, James Cross, Shruti Bhosale, Shafiq Joty

    Abstract: Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model o… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

  22. arXiv:2110.08250  [pdf, other

    cs.CL cs.SD eess.AS

    Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

    Authors: Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino

    Abstract: We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised m… ▽ More

    Submitted 12 January, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

  23. arXiv:2110.07804  [pdf, other

    cs.CL

    Alternative Input Signals Ease Transfer in Multilingual Machine Translation

    Authors: Simeng Sun, Angela Fan, James Cross, Vishrav Chaudhary, Chau Tran, Philipp Koehn, Francisco Guzman

    Abstract: Recent work in multilingual machine translation (MMT) has focused on the potential of positive transfer between languages, particularly cases where higher-resourced languages can benefit lower-resourced ones. While training an MMT model, the supervision signals learned from one language pair can be transferred to the other via the tokens shared by multiple source languages. However, the transfer i… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

  24. arXiv:2110.05691  [pdf, other

    cs.CL

    Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation

    Authors: Weiting Tan, Shuoyang Ding, Huda Khayrallah, Philipp Koehn

    Abstract: Neural Machine Translation (NMT) models are known to suffer from noisy inputs. To make models robust, we generate adversarial augmentation samples that attack the model and preserve the source-side semantic meaning at the same time. To generate such samples, we propose a doubly-trained architecture that pairs two NMT models of opposite translation directions with a joint loss function, which combi… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

  25. arXiv:2109.12640  [pdf, other

    cs.CL

    An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces

    Authors: Kelly Marchisio, Youngser Park, Ali Saad-Eldin, Anton Alyakin, Kevin Duh, Carey Priebe, Philipp Koehn

    Abstract: Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node's graph neighborhood without assuming a linear transform, and exp… ▽ More

    Submitted 26 September, 2021; originally announced September 2021.

    Comments: EMNLP Findings 2021 Camera-Ready

  26. arXiv:2109.08724  [pdf, other

    cs.CL

    The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task

    Authors: Shuoyang Ding, Marcin Junczys-Dowmunt, Matt Post, Christian Federmann, Philipp Koehn

    Abstract: This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip transla… ▽ More

    Submitted 17 September, 2021; originally announced September 2021.

    Comments: 7 Pages, Accepted to WMT21 (System Description)

  27. arXiv:2109.05611  [pdf, other

    cs.CL

    Levenshtein Training for Word-level Quality Estimation

    Authors: Shuoyang Ding, Marcin Junczys-Dowmunt, Matt Post, Philipp Koehn

    Abstract: We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we pro… ▽ More

    Submitted 15 September, 2021; v1 submitted 12 September, 2021; originally announced September 2021.

    Comments: 10 pages, 1 figure, Accepted to EMNLP 2021. Fixed a minor typo in Table 2 (en-zh WMT20 best result)

  28. arXiv:2108.03265  [pdf, other

    cs.CL

    Facebook AI WMT21 News Translation Task Submission

    Authors: Chau Tran, Shruti Bhosale, James Cross, Philipp Koehn, Sergey Edunov, Angela Fan

    Abstract: We describe Facebook's multilingual model submission to the WMT2021 shared task on news translation. We participate in 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese. To develop systems covering all these directions, we focus on multilingual models. We utilize data from all available sources --- WMT, large-scale data mining, and in-domai… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

  29. arXiv:2107.09186  [pdf, other

    cs.CL

    Cross-Lingual BERT Contextual Embedding Space Map** with Isotropic and Isometric Conditions

    Authors: Haoran Xu, Philipp Koehn

    Abstract: Typically, a linearly orthogonal transformation map** is learned by aligning static type-level embeddings to build a shared semantic space. In view of the analysis that contextual embeddings contain richer semantic features, we investigate a context-aware and dictionary-free map** approach by leveraging parallel corpora. We illustrate that our contextual embedding space map** significantly o… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

  30. arXiv:2106.11891  [pdf, other

    cs.CL

    On the Evaluation of Machine Translation for Terminology Consistency

    Authors: Md Mahfuz ibn Alam, Antonios Anastasopoulos, Laurent Besacier, James Cross, Matthias Gallé, Philipp Koehn, Vassilina Nikoulina

    Abstract: As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body of work focuses on combining NMT with terminologies. In many scenarios and particularly in cases of domain adaptation, one expects the MT output to adhere to the constraints provided by a terminology. In this work, we propose metrics to measure the consistency of MT output with… ▽ More

    Submitted 24 June, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: preprint

  31. arXiv:2105.15071  [pdf, other

    cs.CL

    Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

    Authors: Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzmán, Pascale Fung, Philipp Koehn, Mona Diab

    Abstract: The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a… ▽ More

    Submitted 1 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

    Comments: ACL 2021

  32. arXiv:2104.08721  [pdf, other

    cs.CL

    Embedding-Enhanced Giza++: Improving Alignment in Low- and High- Resource Scenarios Using Embedding Space Geometry

    Authors: Kelly Marchisio, Conghao Xiong, Philipp Koehn

    Abstract: A popular natural language processing task decades ago, word alignment has been dominated until recently by GIZA++, a statistical method based on the 30-year-old IBM models. New methods that outperform GIZA++ primarily rely on large machine translation models, massively multilingual language models, or supervision from GIZA++ alignments itself. We introduce Embedding-Enhanced GIZA++, and outperfor… ▽ More

    Submitted 10 October, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: AMTA2022 Camera Ready

  33. arXiv:2104.08597  [pdf, other

    cs.CL

    XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

    Authors: Ahmed El-Kishky, Adithya Renduchintala, James Cross, Francisco Guzmán, Philipp Koehn

    Abstract: Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align)… ▽ More

    Submitted 10 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

  34. arXiv:2104.05824  [pdf, other

    cs.CL

    Evaluating Saliency Methods for Neural Language Models

    Authors: Shuoyang Ding, Philipp Koehn

    Abstract: Saliency methods are widely used to interpret neural network predictions, but different variants of saliency methods often disagree even on the interpretations of the same prediction made by the same model. In these cases, how do we identify when are these interpretations trustworthy enough to be used in analyses? To address this question, we conduct a comprehensive and quantitative evaluation of… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: 19 pages, 2 figures, Accepted for NAACL 2021

  35. arXiv:2103.06968  [pdf, other

    cs.CL

    Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

    Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur

    Abstract: Large web-crawled corpora represent an excellent resource for improving the performance of Neural Machine Translation (NMT) systems across several language pairs. However, since these corpora are typically extremely noisy, their use is fairly limited. Current approaches to dealing with this problem mainly focus on filtering using heuristics or single features such as language model scores or bi-li… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 10 pages, 2 figures

  36. arXiv:2103.06964  [pdf, other

    cs.CL

    Learning Policies for Multilingual Training of Neural Machine Translation Systems

    Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur

    Abstract: Low-resource Multilingual Neural Machine Translation (MNMT) is typically tasked with improving the translation performance on one or more language pairs with the aid of high-resource language pairs. In this paper, we propose two simple search based curricula -- orderings of the multilingual training data -- which help improve translation performance in conjunction with existing techniques such as… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 7 pages, 2 figures

  37. arXiv:2103.02212  [pdf, other

    cs.CL

    Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation

    Authors: Haoran Xu, Philipp Koehn

    Abstract: Linear embedding transformation has been shown to be effective for zero-shot cross-lingual transfer tasks and achieve surprisingly promising results. However, cross-lingual embedding space map** is usually studied in static word-level embeddings, where a space transformation is derived by aligning representations of translation pairs that are referred from dictionaries. We move further from this… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

    Journal ref: Adapt-NLP EACL 2021

  38. arXiv:2011.02048  [pdf, other

    cs.CL

    SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation

    Authors: Xutai Ma, Juan Pino, Philipp Koehn

    Abstract: Simultaneous text translation and end-to-end speech translation have recently made great progress but little work has combined these tasks together. We investigate how to adapt simultaneous text translation methods such as wait-k and monotonic multihead attention to end-to-end simultaneous speech translation by introducing a pre-decision module. A detailed analysis is provided on the latency-quali… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

  39. arXiv:2011.00033  [pdf, other

    cs.CL

    Streaming Simultaneous Speech Translation with Augmented Memory Transformer

    Authors: Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, Philipp Koehn, Juan Pino

    Abstract: Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.

  40. arXiv:2007.01788  [pdf, ps, other

    cs.CL cs.DL cs.IR

    TICO-19: the Translation Initiative for Covid-19

    Authors: Antonios Anastasopoulos, Alessandro Cattelan, Zi-Yi Dou, Marcello Federico, Christian Federman, Dmitriy Genzel, Francisco Guzmán, Junjie Hu, Macduff Hughes, Philipp Koehn, Rosie Lazar, Will Lewis, Graham Neubig, Mengmeng Niu, Alp Öktem, Eric Paquin, Grace Tang, Sylwia Tur

    Abstract: The COVID-19 pandemic is the worst pandemic to strike the world in over a century. Crucial to stemming the tide of the SARS-CoV-2 virus is communicating to vulnerable populations the means by which they can protect themselves. To this end, the collaborators forming the Translation Initiative for COvid-19 (TICO-19) have made test and development data available to AI and MT researchers in 35 differe… ▽ More

    Submitted 6 July, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

  41. Simulated Multiple Reference Training Improves Low-Resource Machine Translation

    Authors: Huda Khayrallah, Brian Thompson, Matt Post, Philipp Koehn

    Abstract: Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and… ▽ More

    Submitted 13 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020 camera ready

  42. arXiv:2004.14523  [pdf, other

    cs.CL

    Exploiting Sentence Order in Document Alignment

    Authors: Brian Thompson, Philipp Koehn

    Abstract: We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala--English documents from ParaCrawl, ou… ▽ More

    Submitted 27 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: EMNLP2020

  43. arXiv:2004.05516  [pdf, other

    cs.CL

    When Does Unsupervised Machine Translation Work?

    Authors: Kelly Marchisio, Kevin Duh, Philipp Koehn

    Abstract: Despite the reported success of unsupervised machine translation (MT), the field has yet to examine the conditions under which these methods succeed, and where they fail. We conduct an extensive empirical evaluation of unsupervised MT using dissimilar language pairs, dissimilar domains, diverse datasets, and authentic low-resource languages. We find that performance rapidly deteriorates when sourc… ▽ More

    Submitted 18 November, 2020; v1 submitted 11 April, 2020; originally announced April 2020.

    Comments: WMT20 Camera Ready

  44. arXiv:1911.06154  [pdf, other

    cs.CL cs.LG stat.ML

    CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

    Authors: Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzman, Philipp Koehn

    Abstract: Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs… ▽ More

    Submitted 11 October, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

    Comments: EMNLP 2020

  45. arXiv:1906.11943  [pdf, other

    cs.CL

    Findings of the First Shared Task on Machine Translation Robustness

    Authors: Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, Orhan Firat, Philipp Koehn, Graham Neubig, Juan Pino, Hassan Sajjad

    Abstract: We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models; robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evalua… ▽ More

    Submitted 3 July, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

  46. arXiv:1906.10282  [pdf, ps, other

    cs.CL

    Saliency-driven Word Alignment Interpretation for Neural Machine Translation

    Authors: Shuoyang Ding, Hainan Xu, Philipp Koehn

    Abstract: Despite their original goal to jointly learn to align and translate, Neural Machine Translation (NMT) models, especially Transformer, are often perceived as not learning interpretable word alignments. In this paper, we show that NMT models do learn interpretable word alignments, which could only be revealed with proper interpretation methods. We propose a series of such methods that are model-agno… ▽ More

    Submitted 27 June, 2019; v1 submitted 24 June, 2019; originally announced June 2019.

    Comments: Accepted to WMT 2019

  47. arXiv:1906.09833  [pdf, other

    cs.CL cs.AI

    Translationese in Machine Translation Evaluation

    Authors: Yvette Graham, Barry Haddow, Philipp Koehn

    Abstract: The term translationese has been used to describe the presence of unusual features of translated text. In this paper, we provide a detailed analysis of the adverse effects of translationese on machine translation evaluation results. Our analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impac… ▽ More

    Submitted 24 June, 2019; originally announced June 2019.

    Comments: 17 pages, 8 figures, 9 tables

  48. arXiv:1906.08885  [pdf, other

    cs.CL

    Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

    Authors: Vishrav Chaudhary, Yuqing Tang, Francisco Guzmán, Holger Schwenk, Philipp Koehn

    Abstract: In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy paralle… ▽ More

    Submitted 20 June, 2019; originally announced June 2019.

    Comments: 6 pages, WMT 2019

    Journal ref: Conference on Machine Translation (WMT) 2019

  49. arXiv:1904.03409  [pdf, other

    cs.CL

    Parallelizable Stack Long Short-Term Memory

    Authors: Shuoyang Ding, Philipp Koehn

    Abstract: Stack Long Short-Term Memory (StackLSTM) is useful for various applications such as parsing and string-to-tree neural machine translation, but it is also known to be notoriously difficult to parallelize for GPU training due to the fact that the computations are dependent on discrete operations. In this paper, we tackle this problem by utilizing state access patterns of StackLSTM to homogenize comp… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

    Comments: Accepted to NAACL 2019 Workshop on Structured Prediction for NLP

  50. arXiv:1902.01382  [pdf, other

    cs.CL

    The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

    Authors: Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, Marc'Aurelio Ranzato

    Abstract: For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLoRes e… ▽ More

    Submitted 14 September, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: EMNLP 2019