Skip to main content

Showing 1–11 of 11 results for author: Thulke, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.09646  [pdf, other

    cs.LG cs.AI cs.CL

    ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change

    Authors: David Thulke, Yingbo Gao, Petrus Pelser, Rein Brune, Rricha Jalota, Floris Fok, Michael Ramos, Ian van Wyk, Abdallah Nasir, Hayden Goldstein, Taylor Tragemann, Katie Nguyen, Ariana Fowler, Andrew Stanco, Jon Gabriel, Jordan Taylor, Dean Moro, Evgenii Tsymbalov, Juliette de Waal, Evgeny Matusov, Mudar Yaghi, Mohammad Shihadah, Hermann Ney, Christian Dugast, Jonathan Dotan , et al. (1 additional authors not shown)

    Abstract: This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change. We trained two 7B models from scratch on a science-oriented dataset of 300B tokens. For the first model, the 4.2B domain-specific tokens were included during pre-training and the second was adapted to the climate domain after pre-training. Addition… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  2. arXiv:2307.01310  [pdf, other

    cs.CL

    Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective

    Authors: Moncef Benaicha, David Thulke, M. A. Tuğtekin Turan

    Abstract: Recent advancements in Named Entity Recognition (NER) have significantly improved the identification of entities in textual data. However, spoken NER, a specialized field of spoken document retrieval, lags behind due to its limited research and scarce datasets. Moreover, cross-lingual transfer learning in spoken NER has remained unexplored. This paper utilizes transfer learning across Dutch, Engli… ▽ More

    Submitted 3 July, 2023; originally announced July 2023.

  3. arXiv:2304.07101  [pdf, other

    cs.CL cs.AI cs.LG

    Task-oriented Document-Grounded Dialog Systems by HLTPR@RWTH for DSTC9 and DSTC10

    Authors: David Thulke, Nico Daheim, Christian Dugast, Hermann Ney

    Abstract: This paper summarizes our contributions to the document-grounded dialog tasks at the 9th and 10th Dialog System Technology Challenges (DSTC9 and DSTC10). In both iterations the task consists of three subtasks: first detect whether the current turn is knowledge seeking, second select a relevant knowledge document, and third generate a response grounded on the selected document. For DSTC9 we propose… ▽ More

    Submitted 14 April, 2023; originally announced April 2023.

    Comments: Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: text overlap with arXiv:2112.08844

  4. arXiv:2211.04898  [pdf, other

    cs.CL cs.AI

    Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

    Authors: Baohao Liao, David Thulke, Sanjika Hewavitharana, Hermann Ney, Christof Monz

    Abstract: The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a la… ▽ More

    Submitted 15 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: Code available at: https://github.com/BaohaoLiao/3ml

  5. arXiv:2210.17418  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model

    Authors: Nico Daheim, David Thulke, Christian Dugast, Hermann Ney

    Abstract: In this work, we present a model for document-grounded response generation in dialog that is decomposed into two components according to Bayes theorem. One component is a traditional ungrounded response generation model and the other component models the reconstruction of the grounding document based on the dialog context and generated response. We propose different approximate decoding schemes an… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: Accepted to Findings of EMNLP 2022

  6. arXiv:2210.13700  [pdf, other

    eess.AS cs.CL cs.LG

    Does Joint Training Really Help Cascaded Speech Translation?

    Authors: Viet Anh Khoa Tran, David Thulke, Yingbo Gao, Christian Herold, Hermann Ney

    Abstract: Currently, in speech translation, the straightforward approach - cascading a recognition system with a translation system - delivers state-of-the-art results. However, fundamental challenges such as error propagation from the automatic speech recognition system still remain. To mitigate these problems, recently, people turn their attention to direct data and propose various joint training methods.… ▽ More

    Submitted 24 November, 2022; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022

  7. arXiv:2112.08844  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Adapting Document-Grounded Dialog Systems to Spoken Conversations using Data Augmentation and a Noisy Channel Model

    Authors: David Thulke, Nico Daheim, Christian Dugast, Hermann Ney

    Abstract: This paper summarizes our submission to Task 2 of the second track of the 10th Dialog System Technology Challenge (DSTC10) "Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations". Similar to the previous year's iteration, the task consists of three subtasks: detecting whether a turn is knowledge seeking, selecting the relevant knowledge document and finally generating a ground… ▽ More

    Submitted 16 December, 2021; originally announced December 2021.

    Comments: Accepted to the DSTC10 workshop at AAAI 2022

  8. Investigation on Data Adaptation Techniques for Neural Named Entity Recognition

    Authors: Evgeniia Tokarchuk, David Thulke, Weiyue Wang, Christian Dugast, Hermann Ney

    Abstract: Data processing is an important step in various natural language processing tasks. As the commonly used datasets in named entity recognition contain only a limited number of samples, it is important to obtain additional labeled data in an efficient and reliable manner. A common practice is to utilize large monolingual unlabeled corpora. Another popular technique is to create synthetic data from th… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: ACL SRW 2021 - camera ready

  9. arXiv:2106.07275  [pdf, other

    cs.CL cs.AI cs.LG

    Cascaded Span Extraction and Response Generation for Document-Grounded Dialog

    Authors: Nico Daheim, David Thulke, Christian Dugast, Hermann Ney

    Abstract: This paper summarizes our entries to both subtasks of the first DialDoc shared task which focuses on the agent response prediction task in goal-oriented document-grounded dialogs. The task is split into two subtasks: predicting a span in a document that grounds an agent turn and generating an agent response based on a dialog and grounding document. In the first subtask, we restrict the set of vali… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted by 1st DialDoc Workshop at ACL-IJCNLP 2021

  10. arXiv:2104.10507  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    On Sampling-Based Training Criteria for Neural Language Modeling

    Authors: Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney

    Abstract: As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack o… ▽ More

    Submitted 17 June, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

    Comments: Accepted at INTERSPEECH 2021

  11. arXiv:2102.04643  [pdf, ps, other

    cs.CL

    Efficient Retrieval Augmented Generation from Unstructured Knowledge for Task-Oriented Dialog

    Authors: David Thulke, Nico Daheim, Christian Dugast, Hermann Ney

    Abstract: This paper summarizes our work on the first track of the ninth Dialog System Technology Challenge (DSTC 9), "Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access". The goal of the task is to generate responses to user turns in a task-oriented dialog that require knowledge from unstructured documents. The task is divided into three subtasks: detection, select… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: Accepted by DSTC9 Workshop at AAAI-2021