Skip to main content

Showing 1–10 of 10 results for author: Botha, J A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2211.00142  [pdf, other

    cs.CL cs.LG

    TaTa: A Multilingual Table-to-Text Dataset for African Languages

    Authors: Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera

    Abstract: Existing data-to-text generation datasets are mostly limited to English. To address this lack of data, we create Table-to-Text in African languages (TaTa), the first large multilingual table-to-text dataset with a focus on African languages. We created TaTa by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional tra… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: 24 pages, 6 figures

  2. arXiv:2210.00193  [pdf, other

    cs.CL

    FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

    Authors: Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, Noah Constant

    Abstract: We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distr… ▽ More

    Submitted 3 October, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

    Comments: Published in TACL Vol. 11 (2023)

  3. arXiv:2106.07352  [pdf, other

    cs.IR cs.CL cs.LG cs.SI

    MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network

    Authors: Nicholas FitzGerald, Jan A. Botha, Daniel Gillick, Daniel M. Bikel, Tom Kwiatkowski, Andrew McCallum

    Abstract: We present an instance-based nearest neighbor approach to entity linking. In contrast to most prior entity retrieval systems which represent each entity with a single vector, we build a contextualized mention-encoder that learns to place similar mentions of the same entity closer in vector space than mentions of different entities. This approach allows all mentions of an entity to serve as "class… ▽ More

    Submitted 22 July, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL 2021, edit to add missing Turkish results in Tables 2 and 7

  4. arXiv:2011.02690  [pdf, other

    cs.CL cs.IR

    Entity Linking in 100 Languages

    Authors: Jan A. Botha, Zifei Shan, Daniel Gillick

    Abstract: We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: 13 pages, 3 figures, 8 tables; published at EMNLP 2020

    ACM Class: I.2.7; H.3.3

  5. arXiv:2004.14513  [pdf, other

    cs.CL

    Asking without Telling: Exploring Latent Ontologies in Contextual Representations

    Authors: Julian Michael, Jan A. Botha, Ian Tenney

    Abstract: The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods th… ▽ More

    Submitted 8 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: 21 pages, 8 figures, 11 tables. Published in EMNLP 2020

    ACM Class: I.2.7

  6. arXiv:1808.09468  [pdf, ps, other

    cs.CL

    Learning To Split and Rephrase From Wikipedia Edit History

    Authors: Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, Dipanjan Das

    Abstract: Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Journal ref: Proc. of EMNLP 2018

  7. arXiv:1708.00214  [pdf, other

    cs.CL cs.NE

    Natural Language Processing with Small Feed-Forward Networks

    Authors: Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, Slav Petrov

    Abstract: We show that small and shallow feed-forward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural… ▽ More

    Submitted 1 August, 2017; originally announced August 2017.

    Comments: EMNLP 2017 short paper

    MSC Class: 68T50 ACM Class: I.2.7

  8. arXiv:1606.04279  [pdf, other

    cs.CL

    Cross-Lingual Morphological Tagging for Low-Resource Languages

    Authors: Jan Buys, Jan A. Botha

    Abstract: Morphologically rich languages often lack the annotated linguistic resources required to develop accurate natural language processing tools. We propose models suitable for training morphological taggers with rich tagsets for low-resource languages without using direct supervision. Our approach extends existing approaches of projecting part-of-speech tags across languages, using bitext to infer con… ▽ More

    Submitted 14 June, 2016; originally announced June 2016.

    Comments: 11 pages. ACL 2016

  9. arXiv:1508.04271  [pdf, other

    cs.CL

    Probabilistic Modelling of Morphologically Rich Languages

    Authors: Jan A. Botha

    Abstract: This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich in… ▽ More

    Submitted 18 August, 2015; originally announced August 2015.

    Comments: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c7

    ACM Class: I.2.7; I.2.6

  10. arXiv:1405.4273  [pdf, other

    cs.CL

    Compositional Morphology for Word Representations and Language Modelling

    Authors: Jan A. Botha, Phil Blunsom

    Abstract: This paper presents a scalable method for integrating compositional morphological representations into a vector-based probabilistic language model. Our approach is evaluated in the context of log-bilinear language models, rendered suitably efficient for implementation inside a machine translation decoder by factoring the vocabulary. We perform both intrinsic and extrinsic evaluations, presenting r… ▽ More

    Submitted 16 May, 2014; originally announced May 2014.

    Comments: Proceedings of the 31st International Conference on Machine Learning (ICML)

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6