Search | arXiv e-print repository

doi 10.1145/3544558

A Survey on Data Augmentation for Text Classification

Authors: Markus Bayer, Marc-André Kaufhold, Christian Reuter

Abstract: Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the… ▽ More Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation and a taxonomy for existing works, this survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners. Derived from the taxonomy, we divide more than 100 methods into 12 different grou**s and give state-of-the-art references expounding which methods are highly promising by relating them to each other. Finally, research perspectives that may constitute a building block for future work are provided. △ Less

Submitted 8 September, 2022; v1 submitted 7 July, 2021; originally announced July 2021.

Comments: 44 pages, 5 figures, 9 tables

Journal ref: ACM Computing Surveys (2022)

arXiv:2103.14453 [pdf, other]

doi 10.1007/s13042-022-01553-3

Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers

Authors: Markus Bayer, Marc-André Kaufhold, Björn Buchhold, Marcel Keller, Jörg Dallmeyer, Christian Reuter

Abstract: In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new li… ▽ More In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets. △ Less

Submitted 22 July, 2022; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: 17 pages, 3 figure, 5 tables

Journal ref: International Journal of Machine Learning and Cybernetics (2022)

arXiv:2005.04910 [pdf]

doi 10.1016/j.ijdrr.2020.101598

Empirical Insights for Designing Information and Communication Technology for International Disaster Response

Authors: Milan Stute, Max Maass, Tom Schons, Marc-André Kaufhold, Christian Reuter, Matthias Hollick

Abstract: Due to the increase in natural disasters in the past years, Disaster Response Organizations (DROs) are faced with the challenge of co** with more and larger operations. Currently appointed Information and Communications Technology (ICT) used for coordination and communication is sometimes outdated and does not scale, while novel technologies have the potential to greatly improve disaster respons… ▽ More Due to the increase in natural disasters in the past years, Disaster Response Organizations (DROs) are faced with the challenge of co** with more and larger operations. Currently appointed Information and Communications Technology (ICT) used for coordination and communication is sometimes outdated and does not scale, while novel technologies have the potential to greatly improve disaster response efficiency. To allow adoption of these novel technologies, ICT system designers have to take into account the particular needs of DROs and characteristics of International Disaster Response (IDR). This work attempts to bring the humanitarian and ICT communities closer together. In this work, we analyze IDR-related documents and conduct expert interviews. Using open coding, we extract empirical insights and translate the peculiarities of DRO coordination and operation into tangible ICT design requirements. This information is based on interviews with active IDR staff as well as DRO guidelines and reports. Ultimately, the goal of this paper is to serve as a reference for future ICT research endeavors to support and increase the efficiency of IDR operations. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Journal ref: International Journal of Disaster Risk Reduction, Volume 47, August 2020, 101598

arXiv:1907.07725 [pdf]

Cross-Media Usage of Social Big Data for Emergency Services and Volunteer Communities: Approaches, Development and Challenges of Multi-Platform Social Media Services

Authors: Marc-André Kaufhold, Christian Reuter, Thomas Ludwig

Abstract: The use of social media is ubiquitous and nowadays well-established in our everyday life, but increasingly also before, during or after emergencies. The produced data is spread across several types of social media and can be used by different actors, such as emergency services or volunteer communities. There are already systems available that support the process of gathering, analysing and distrib… ▽ More The use of social media is ubiquitous and nowadays well-established in our everyday life, but increasingly also before, during or after emergencies. The produced data is spread across several types of social media and can be used by different actors, such as emergency services or volunteer communities. There are already systems available that support the process of gathering, analysing and distributing information through social media. However, dependent on the goal of analysis, the analysis methods and available systems are limited based on technical or business-oriented restrictions. This paper presents the design of a cross-platform Social Media API, which was integrated and evaluated within multiple emergency scenarios. Based on the lessons learned, we outline the core challenges from the practical development and theoretical findings, focusing (1) cross-platform gathering and data management, (2) trustability and information quality, (3) tailorability and adjustable data operations, and (4) queries, performance, and technical development. △ Less

Submitted 17 July, 2019; originally announced July 2019.

Showing 1–4 of 4 results for author: Kaufhold, M