Search | arXiv e-print repository

German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data

Authors: Lars Klöser, Mika Beele, Jan-Niklas Schagen, Bodo Kraft

Abstract: This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the challenge of data scarcity in language simplification, we crawled professionally simplified German texts and synthesized a corpus using GPT-4. We finetune Large… ▽ More This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the challenge of data scarcity in language simplification, we crawled professionally simplified German texts and synthesized a corpus using GPT-4. We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance. This paper employs various methodologies for evaluation and demonstrates the limitations of currently used rule-based metrics. Both automatic and manual evaluations reveal that our models can significantly simplify real-world online texts, indicating the potential of synthetic data in improving text simplification. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted at Fourth Workshop on Language Technology for Equality, Diversity, Inclusion - EACL 2024

ACM Class: I.2.7

arXiv:2308.02537 [pdf, other]

doi 10.1007/978-3-031-39059-3_16

ALE: A Simulation-Based Active Learning Evaluation Framework for the Parameter-Driven Comparison of Query Strategies for NLP

Authors: Philipp Kohl, Nils Freyer, Yoka Krämer, Henri Werth, Steffen Wolf, Bodo Kraft, Matthias Meinecke, Albert Zündorf

Abstract: Supervised machine learning and deep learning require a large amount of labeled data, which data scientists obtain in a manual, and time-consuming annotation process. To mitigate this challenge, Active Learning (AL) proposes promising data points to annotators they annotate next instead of a subsequent or random sample. This method is supposed to save annotation effort while maintaining model perf… ▽ More Supervised machine learning and deep learning require a large amount of labeled data, which data scientists obtain in a manual, and time-consuming annotation process. To mitigate this challenge, Active Learning (AL) proposes promising data points to annotators they annotate next instead of a subsequent or random sample. This method is supposed to save annotation effort while maintaining model performance. However, practitioners face many AL strategies for different tasks and need an empirical basis to choose between them. Surveys categorize AL strategies into taxonomies without performance indications. Presentations of novel AL strategies compare the performance to a small subset of strategies. Our contribution addresses the empirical basis by introducing a reproducible active learning evaluation (ALE) framework for the comparative evaluation of AL strategies in NLP. The framework allows the implementation of AL strategies with low effort and a fair data-driven comparison through defining and tracking experiment parameters (e.g., initial dataset size, number of data points per query step, and the budget). ALE helps practitioners to make more informed decisions, and researchers can focus on develo** new, effective AL strategies and deriving best practices for specific use cases. With best practices, practitioners can lower their annotation costs. We present a case study to illustrate how to use the framework. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: The Version of Record of this contribution is published in Deep Learning Theory and Applications 4th International Conference, DeLTA 2023 Proceedings, and is available online at https://doi.org/10.1007/978-3-031-39059-3_16

Journal ref: Conte, D., Fred, A., Gusikhin, O., Sansone, C. (eds) Deep Learning Theory and Applications. DeLTA 2023. Communications in Computer and Information Science, vol 1875. Springer, Cham

arXiv:2308.02193 [pdf, other]

doi 10.1007/978-3-031-39059-3_13

Explaining Relation Classification Models with Semantic Extents

Authors: Lars Klöser, Andre Büsgen, Philipp Kohl, Bodo Kraft, Albert Zündorf

Abstract: In recent years, the development of large pretrained language models, such as BERT and GPT, significantly improved information extraction systems on various tasks, including relation classification. State-of-the-art systems are highly accurate on scientific benchmarks. A lack of explainability is currently a complicating factor in many real-world applications. Comprehensible systems are necessary… ▽ More In recent years, the development of large pretrained language models, such as BERT and GPT, significantly improved information extraction systems on various tasks, including relation classification. State-of-the-art systems are highly accurate on scientific benchmarks. A lack of explainability is currently a complicating factor in many real-world applications. Comprehensible systems are necessary to prevent biased, counterintuitive, or harmful decisions. We introduce semantic extents, a concept to analyze decision patterns for the relation classification task. Semantic extents are the most influential parts of texts concerning classification decisions. Our definition allows similar procedures to determine semantic extents for humans and models. We provide an annotation tool and a software framework to determine semantic extents for humans and models conveniently and reproducibly. Comparing both reveals that models tend to learn shortcut patterns from data. These patterns are hard to detect with current interpretability methods, such as input reductions. Our approach can help detect and eliminate spurious decision patterns during model development. Semantic extents can increase the reliability and security of natural language processing systems. Semantic extents are an essential step in enabling applications in critical areas like healthcare or finance. Moreover, our work opens new research directions for develo** methods to explain deep learning models. △ Less

Submitted 4 August, 2023; originally announced August 2023.

Comments: Accepted at DeLTA 2023: Deep Learning Theory and Applications conference

ACM Class: I.2.7

arXiv:2111.09035 [pdf, other]

doi 10.5220/0010559201480156

Multi-Attribute Relation Extraction (MARE) -- Simplifying the Application of Relation Extraction

Authors: Lars Klöser, Philipp Kohl, Bodo Kraft, Albert Zündorf

Abstract: Natural language understanding's relation extraction makes innovative and encouraging novel business concepts possible and facilitates new digitilized decision-making processes. Current approaches allow the extraction of relations with a fixed number of entities as attributes. Extracting relations with an arbitrary amount of attributes requires complex systems and costly relation-trigger annotatio… ▽ More Natural language understanding's relation extraction makes innovative and encouraging novel business concepts possible and facilitates new digitilized decision-making processes. Current approaches allow the extraction of relations with a fixed number of entities as attributes. Extracting relations with an arbitrary amount of attributes requires complex systems and costly relation-trigger annotations to assist these systems. We introduce multi-attribute relation extraction (MARE) as an assumption-less problem formulation with two approaches, facilitating an explicit map** from business use cases to the data annotations. Avoiding elaborated annotation constraints simplifies the application of relation extraction approaches. The evaluation compares our models to current state-of-the-art event extraction and binary relation extraction methods. Our approaches show improvement compared to these on the extraction of general multi-attribute relations. △ Less

Submitted 17 November, 2021; originally announced November 2021.

Comments: Preprint of short paper for the 2nd International Conference on Deep Learning Theory and Applications (2021)

Journal ref: Proceedings of the 2nd International Conference on Deep Learning Theory and Applications, Vol. 1, (2021), P. 148 - 156

arXiv:2111.08408 [pdf, other]

doi 10.1007/978-3-030-85347-1_12

STAMP 4 NLP -- An Agile Framework for Rapid Quality-Driven NLP Applications Development

Authors: Philipp Kohl, Oliver Schmidts, Lars Klöser, Henri Werth, Bodo Kraft, Albert Zündorf

Abstract: The progress in natural language processing (NLP) research over the last years, offers novel business opportunities for companies, as automated user interaction or improved data analysis. Building sophisticated NLP applications requires dealing with modern machine learning (ML) technologies, which impedes enterprises from establishing successful NLP projects. Our experience in applied NLP research… ▽ More The progress in natural language processing (NLP) research over the last years, offers novel business opportunities for companies, as automated user interaction or improved data analysis. Building sophisticated NLP applications requires dealing with modern machine learning (ML) technologies, which impedes enterprises from establishing successful NLP projects. Our experience in applied NLP research projects shows that the continuous integration of research prototypes in production-like environments with quality assurance builds trust in the software and shows convenience and usefulness regarding the business goal. We introduce STAMP 4 NLP as an iterative and incremental process model for develo** NLP applications. With STAMP 4 NLP, we merge software engineering principles with best practices from data science. Instantiating our process model allows efficiently creating prototypes by utilizing templates, conventions, and implementations, enabling developers and data scientists to focus on the business goals. Due to our iterative-incremental approach, businesses can deploy an enhanced version of the prototype to their software environment after every iteration, maximizing potential business value and trust early and avoiding the cost of successful yet never deployed experiments. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: Preprint of short paper for QUATIC 2021 conference

Journal ref: Quality of Information and Communications Technology, 2021, p. 156-166

arXiv:1909.10296 [pdf, other]

Predicting Landscapes from Environmental Conditions Using Generative Networks

Authors: Christian Requena-Mesa, Markus Reichstein, Miguel Mahecha, Basil Kraft, Joachim Denzler

Abstract: Landscapes are meaningful ecological units that strongly depend on the environmental conditions. Such dependencies between landscapes and the environment have been noted since the beginning of Earth sciences and cast into conceptual models describing the interdependencies of climate, geology, vegetation and geomorphology. Here, we ask whether landscapes, as seen from space, can be statistically pr… ▽ More Landscapes are meaningful ecological units that strongly depend on the environmental conditions. Such dependencies between landscapes and the environment have been noted since the beginning of Earth sciences and cast into conceptual models describing the interdependencies of climate, geology, vegetation and geomorphology. Here, we ask whether landscapes, as seen from space, can be statistically predicted from pertinent environmental conditions. To this end we adapted a deep learning generative model in order to establish the relationship between the environmental conditions and the view of landscapes from the Sentinel-2 satellite. We trained a conditional generative adversarial network to generate multispectral imagery given a set of climatic, terrain and anthropogenic predictors. The generated imagery of the landscapes share many characteristics with the real one. Results based on landscape patch metrics, indicative of landscape composition and structure, show that the proposed generative model creates landscapes that are more similar to the targets than the baseline models while overall reflectance and vegetation cover are predicted better. We demonstrate that for many purposes the generated landscapes behave as real with immediate application for global change studies. We envision the application of machine learning as a tool to forecast the effects of climate change on the spatial features of landscapes, while we assess its limitations and breaking points. △ Less

Submitted 23 September, 2019; originally announced September 2019.

Comments: Accepted conference paper at GCPR2019

Showing 1–6 of 6 results for author: Kraft, B