Search | arXiv e-print repository

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Authors: Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm

Abstract: Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training d… ▽ More Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half. △ Less

Submitted 19 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: to be published in Findings of the Association for Computational Linguistics: ACL 2024

arXiv:2404.11726 [pdf, other]

Investigating Gender Bias in Turkish Language Models

Authors: Orhun Caglidil, Malte Ostendorff, Georg Rehm

Abstract: Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applications. However, prior research has primarily focused on the English language, especially in the context of gender bias. In particular, grammatically gender-neutral… ▽ More Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applications. However, prior research has primarily focused on the English language, especially in the context of gender bias. In particular, grammatically gender-neutral languages such as Turkish are underexplored despite representing different linguistic properties to language models with possibly different effects on biases. In this paper, we fill this research gap and investigate the significance of gender bias in Turkish language models. We build upon existing bias evaluation frameworks and extend them to the Turkish language by translating existing English tests and creating new ones designed to measure gender bias in the context of Türkiye. Specifically, we also evaluate Turkish language models for their embedded ethnic bias toward Kurdish people. Based on the experimental results, we attribute possible biases to different model characteristics such as the model size, their multilingualism, and the training corpora. We make the Turkish gender bias dataset publicly available. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:1903.10561 by other authors

arXiv:2404.08443 [pdf, other]

Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph

Authors: Raia Abu Ahmad, Jennifer D'Souza, Matthäus Zloch, Wolfgang Otto, Georg Rehm, Allard Oelen, Stefan Dietze, Sören Auer

Abstract: Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research… ▽ More Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 8 pages, 1 figure, published in the Joint Proceedings of the Onto4FAIR 2023 Workshops

Journal ref: In Joint Proceedings of the Onto4FAIR 2023 Workshops: Collocated with FOIS 2023 and SEMANTICS 2023. pp.23-31. https://hal.science/hal-04312604

arXiv:2301.09626 [pdf, other]

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Authors: Malte Ostendorff, Georg Rehm

Abstract: Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address t… ▽ More Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlap** vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization. △ Less

Submitted 23 January, 2023; originally announced January 2023.

arXiv:2204.13943 [pdf, other]

User Experience Design for Automatic Credibility Assessment of News Content About COVID-19

Authors: Konstantin Schulz, Jens Rauenbusch, Jan Fillies, Lisa Rutenburg, Dimitrios Karvelas, Georg Rehm

Abstract: The increasingly rapid spread of information about COVID-19 on the web calls for automatic measures of quality assurance. In that context, we check the credibility of news content using selected linguistic features. We present two empirical studies to evaluate the usability of graphical interfaces that offer such credibility assessment. In a moderated qualitative interview with six participants, w… ▽ More The increasingly rapid spread of information about COVID-19 on the web calls for automatic measures of quality assurance. In that context, we check the credibility of news content using selected linguistic features. We present two empirical studies to evaluate the usability of graphical interfaces that offer such credibility assessment. In a moderated qualitative interview with six participants, we identify rating scale, sub-criteria and algorithm authorship as important predictors of the usability. A subsequent quantitative online survey with 50 participants reveals a conflict between transparency and conciseness in the interface design, as well as a perceived hierarchy of metadata: the authorship of a news text is more important than the authorship of the credibility algorithm used to assess the content quality. Finally, we make suggestions for future research, such as proactively documenting credibility-related metadata for Natural Language Processing and Language Technology services and establishing an explicit hierarchical taxonomy of usability predictors for automatic credibility assessment. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: 25 pages, 7 figures, to be published in HCI International 2022 - Late Breaking Papers

MSC Class: 68-04 ACM Class: H.5.2; I.2.7

arXiv:2203.14541 [pdf, other]

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Authors: Malte Ostendorff, Till Blume, Terry Ruas, Bela Gipp, Georg Rehm

Abstract: Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been… ▽ More Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t.the corpus size. In an empirical study, we use the Papers with Code corpus containing 157,606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted for publication at JCDL 2022

arXiv:2203.09629 [pdf, other]

HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information

Authors: Qian Ruan, Malte Ostendorff, Georg Rehm

Abstract: Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical stru… ▽ More Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model based on a pre-trained, encoder-only Transformer language model (HiStruct+ model), which improves SOTA ROUGEs for extractive summarization on PubMed and arXiv substantially. Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively, which differs from our model only in that the hierarchical structure information is not injected. It is also observed that the more conspicuous hierarchical structure the dataset has, the larger improvements our method gains. The ablation study demonstrates that the hierarchical position information is the main contributor to our model's SOTA performance. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: 17 pages, 3 figures, to be published in Findings ACL 2022

arXiv:2202.06671 [pdf, other]

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

Authors: Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, Georg Rehm

Abstract: Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-i… ▽ More Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain. △ Less

Submitted 19 October, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: Accepted to EMNLP 2022

arXiv:2109.12323 [pdf]

Deep Learning-Based Detection of the Acute Respiratory Distress Syndrome: What Are the Models Learning?

Authors: Gregory B. Rehm, Chao Wang, Irene Cortes-Puch, Chen-Nee Chuah, Jason Adams

Abstract: The acute respiratory distress syndrome (ARDS) is a severe form of hypoxemic respiratory failure with in-hospital mortality of 35-46%. High mortality is thought to be related in part to challenges in making a prompt diagnosis, which may in turn delay implementation of evidence-based therapies. A deep neural network (DNN) algorithm utilizing unbiased ventilator waveform data (VWD) may help to impro… ▽ More The acute respiratory distress syndrome (ARDS) is a severe form of hypoxemic respiratory failure with in-hospital mortality of 35-46%. High mortality is thought to be related in part to challenges in making a prompt diagnosis, which may in turn delay implementation of evidence-based therapies. A deep neural network (DNN) algorithm utilizing unbiased ventilator waveform data (VWD) may help to improve screening for ARDS. We first show that a convolutional neural network-based ARDS detection model can outperform prior work with random forest models in AUC (0.95+/-0.019 vs. 0.88+/-0.064), accuracy (0.84+/-0.026 vs 0.80+/-0.078), and specificity (0.81+/-0.06 vs 0.71+/-0.089). Frequency ablation studies imply that our model can learn features from low frequency domains typically used for expert feature engineering, and high-frequency information that may be difficult to manually featurize. Further experiments suggest that subtle, high-frequency components of physiologic signals may explain the superior performance of DL models over traditional ML when using physiologic waveform data. Our observations may enable improved interpretability of DL-based physiologic models and may improve the understanding of how high-frequency information in physiologic data impacts the performance our DL model. △ Less

Submitted 25 September, 2021; originally announced September 2021.

arXiv:2109.10224 [pdf]

Clinical Validation of Single-Chamber Model-Based Algorithms Used to Estimate Respiratory Compliance

Authors: Gregory Rehm, Jimmy Nguyen, Chelsea Gilbeau, Marc T Bomactao, Chen-Nee Chuah, Jason Adams

Abstract: Non-invasive estimation of respiratory physiology using computational algorithms promises to be a valuable technique for future clinicians to detect detrimental changes in patient pathophysiology. However, few clinical algorithms used to non-invasively analyze lung physiology have undergone rigorous validation in a clinical setting, and are often validated either using mechanical devices, or with… ▽ More Non-invasive estimation of respiratory physiology using computational algorithms promises to be a valuable technique for future clinicians to detect detrimental changes in patient pathophysiology. However, few clinical algorithms used to non-invasively analyze lung physiology have undergone rigorous validation in a clinical setting, and are often validated either using mechanical devices, or with small clinical validation datasets using 2-8 patients. This work aims to improve this situation by first, establishing an open, and clinically validated dataset comprising data from both mechanical lungs and nearly 40,000 breaths from 18 intubated patients. Next, we use this data to evaluate 15 different algorithms that use the "single chamber" model of estimating respiratory compliance. We evaluate these algorithms under varying clinical scenarios patients typically experience during hospitalization. In particular, we explore algorithm performance under four different types of patient ventilator asynchrony. We also analyze algorithms under varying ventilation modes to benchmark algorithm performance and to determine if ventilation mode has any impact on the algorithm. Our approach yields several advances by 1) showing which specific algorithms work best clinically under varying mode and asynchrony scenarios, 2) develo** a simple mathematical method to reduce variance in algorithmic results, and 3) presenting additional insights about single-chamber model algorithms. We hope that our paper, approach, dataset, and software framework can thus be used by future researchers to improve their work and allow future integration of "single chamber" algorithms into clinical practice. △ Less

Submitted 19 September, 2021; originally announced September 2021.

arXiv:2104.13841 [pdf, other]

Evaluating Document Representations for Content-based Legal Literature Recommendations

Authors: Malte Ostendorff, Elliott Ash, Terry Ruas, Bela Gipp, Julian Moreno-Schneider, Georg Rehm

Abstract: Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datase… ▽ More Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datasets. Thus, these studies have limited reproducibility. To address the gap between research and practice, we explore a set of state-of-the-art document representation methods for the task of retrieving semantically related US case law. We evaluate text-based (e.g., fastText, Transformers), citation-based (e.g., DeepWalk, Poincaré), and hybrid methods. We compare in total 27 methods using two silver standards with annotations for 2,964 documents. The silver standards are newly created from Open Case Book and Wikisource and can be reused under an open license facilitating reproducibility. Our experiments show that document representations from averaged fastText word vectors (trained on legal corpora) yield the best results, closely followed by Poincaré citation embeddings. Combining fastText and Poincaré in a hybrid manner further improves the overall result. Besides the overall performance, we analyze the methods depending on document length, citation count, and the coverage of their recommendations. We make our source code, models, and datasets publicly available at https://github.com/malteos/legal-document-similarity/. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: Accepted for publication at ICAIL 2021

arXiv:2010.06395 [pdf, other]

Aspect-based Document Similarity for Research Papers

Authors: Malte Ostendorff, Terry Ruas, Till Blume, Bela Gipp, Georg Rehm

Abstract: Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classifi… ▽ More Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Accepted for publication at COLING 2020

arXiv:2009.00345 [pdf, other]

Multi-Array Electron Beam Stabilization using Block-Circulant Transformation and Generalized Singular Value Decomposition

Authors: Idris Kempf, Stephen R. Duncan, Paul J. Goulart, Guenther Rehm

Abstract: We introduce a novel structured controller design for the electron beam stabilization problem of the UK's national synchrotron light source. Because changes to the synchrotron will not allow the application of existing control approaches, we develop a novel method to diagonalize the multi-input multi-output (MIMO) system. A generalized singular value decomposition (GSVD) is used to simultaneously… ▽ More We introduce a novel structured controller design for the electron beam stabilization problem of the UK's national synchrotron light source. Because changes to the synchrotron will not allow the application of existing control approaches, we develop a novel method to diagonalize the multi-input multi-output (MIMO) system. A generalized singular value decomposition (GSVD) is used to simultaneously diagonalize the actuator response matrices, which is applicable to an arbitrary number of actuator dynamics in a cross-directional setting. The resulting decoupled systems are regulated using mid-ranged control and the controller gains derived as a function of the generalized singular values. In addition, we exploit the inherent block-circulant symmetry of the system. The performance of our controller is demonstrated using simulations that involve machine data. △ Less

Submitted 1 September, 2020; originally announced September 2020.

arXiv:2008.13428 [pdf, ps, other]

doi 10.1109/TNS.2021.3052553

Symmetry Exploitation in Orbit Feedback Systems of Synchrotron Storage Rings

Authors: Idris Kempf, Paul J. Goulart, Stephen R. Duncan, Guenther Rehm

Abstract: Structural symmetries in the storage ring of synchrotrons are intentionally created during the design phase of the magnetic lattices, but they are not considered in the design of control algorithms that stabilize the beam of accelerated particles. The choice of control algorithm, however, is limited by the speed requirements of the synchrotron. Standard control algorithms for synchrotrons are base… ▽ More Structural symmetries in the storage ring of synchrotrons are intentionally created during the design phase of the magnetic lattices, but they are not considered in the design of control algorithms that stabilize the beam of accelerated particles. The choice of control algorithm, however, is limited by the speed requirements of the synchrotron. Standard control algorithms for synchrotrons are based on a singular value decomposition (SVD) of the orbit response matrix. SVD controllers neither exploit the structural symmetries nor exhibit any speed advantages. Based on the periodicity and the reflection properties of the betatron function, we show that these structural symmetries are inherited by the orbit response matrix. We show that the resulting block-circulant and centrosymmetric properties of the matrix can be used for different computationally efficient decompositions of the controller. We also address the case of broken symmetry due to odd placements of magnets and monitors. Our efficient decomposition could enable the use of more advanced control techniques for synchrotrons, such as control algorithms that require real-time optimization. These advanced control techniques could in turn increase the quality of research in synchrotron light sources. △ Less

Submitted 31 August, 2020; originally announced August 2020.

arXiv:2004.14130 [pdf, other]

A Workflow Manager for Complex NLP and Content Curation Pipelines

Authors: Julián Moreno-Schneider, Peter Bourgonje, Florian Kintzel, Georg Rehm

Abstract: We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on… ▽ More We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on its custom definition language, explaining the communication components and the general system architecture and setup. We currently implement the system, which is grounded and motivated by real-world industry use cases in several innovation and transfer projects. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: Proceedings of the 1st International Workshop on Language Technology Platforms (IWLTP 2020). To appear

arXiv:2004.12195 [pdf, other]

QURATOR: Innovative Technologies for Content and Data Curation

Authors: Georg Rehm, Peter Bourgonje, Stefanie Hegele, Florian Kintzel, Julián Moreno Schneider, Malte Ostendorff, Karolina Zaczynska, Armin Berger, Stefan Grill, Sören Räuchle, Jens Rauenbusch, Lisa Rutenburg, André Schmidt, Mikka Wild, Henry Hoffmann, Julian Fink, Sarah Schulz, Jurica Seva, Joachim Quantz, Joachim Böttger, Josefine Matthey, Rolf Fricke, Jan Thomsen, Adrian Paschke, Jamal Al Qundus , et al. (15 additional authors not shown)

Abstract: In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industr… ▽ More In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industry and a broad range of expertise in AI, Machine Learning and Language Technologies, the QURATOR project, funded by the German Federal Ministry of Education and Research, develops a sustainable and innovative technology platform that provides services to support knowledge workers in various industries to address the challenges they face when curating digital content. The project's vision and ambition is to establish an ecosystem for content curation technologies that significantly pushes the current state of the art and transforms its region, the metropolitan area Berlin-Brandenburg, into a global centre of excellence for curation technologies. △ Less

Submitted 25 April, 2020; originally announced April 2020.

Comments: Proceedings of QURATOR 2020: The conference for intelligent content solutions, Berlin, Germany, February 2020

arXiv:2004.12190 [pdf, other]

Towards Discourse Parsing-inspired Semantic Storytelling

Authors: Georg Rehm, Karolina Zaczynska, Julián Moreno-Schneider, Malte Ostendorff, Peter Bourgonje, Maria Berger, Jens Rauenbusch, André Schmidt, Mikka Wild

Abstract: Previous work of ours on Semantic Storytelling uses text analytics procedures including Named Entity Recognition and Event Detection. In this paper, we outline our longer-term vision on Semantic Storytelling and describe the current conceptual and technical approach. In the project that drives our research we develop AI-based technologies that are verified by partners from industry. One long-term… ▽ More Previous work of ours on Semantic Storytelling uses text analytics procedures including Named Entity Recognition and Event Detection. In this paper, we outline our longer-term vision on Semantic Storytelling and describe the current conceptual and technical approach. In the project that drives our research we develop AI-based technologies that are verified by partners from industry. One long-term goal is the development of an approach for Semantic Storytelling that has broad coverage and that is, furthermore, robust. We provide first results on experiments that involve discourse parsing, applied to a concrete use case, "Explore the Neighbourhood!", which is based on a semi-automatically collected data set with documents about noteworthy people in one of Berlin's districts. Though automatically obtaining annotations for coherence relations from plain text is a non-trivial challenge, our preliminary results are promising. We envision our approach to be combined with additional features (NER, coreference resolution, knowledge graphs △ Less

Submitted 25 April, 2020; originally announced April 2020.

Comments: Proceedings of QURATOR 2020: The conference for intelligent content solutions, Berlin, Germany, February 2020

arXiv:2004.10283 [pdf, other]

Observations on Annotations

Authors: Georg Rehm

Abstract: The annotation of textual information is a fundamental activity in Linguistics and Computational Linguistics. This article presents various observations on annotations. It approaches the topic from several angles including Hypertext, Computational Linguistics and Language Technology, Artificial Intelligence and Open Science. Annotations can be examined along different dimensions. In terms of compl… ▽ More The annotation of textual information is a fundamental activity in Linguistics and Computational Linguistics. This article presents various observations on annotations. It approaches the topic from several angles including Hypertext, Computational Linguistics and Language Technology, Artificial Intelligence and Open Science. Annotations can be examined along different dimensions. In terms of complexity, they can range from trivial to highly sophisticated, in terms of maturity from experimental to standardised. Annotations can be annotated themselves using more abstract annotations. Primary research data such as, e.g., text documents can be annotated on different layers concurrently, which are independent but can be exploited using multi-layer querying. Standards guarantee interoperability and reusability of data sets. The chapter concludes with four final observations, formulated as research questions or rather provocative remarks on the current state of annotation research. △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: To be published in: Annotations in Scholarly Editions and Research: Functions, Differentiation, Systematization (2020), Julia Nantke and Frederik Schlupkothen (editors). De Gruyter. In print

arXiv:2004.08355 [pdf]

Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability

Authors: Georg Rehm, Dimitrios Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian Chiarcos, Nils Feldhus, Julián Moreno-Schneider, Florian Kintzel, Elena Montiel, Víctor Rodríguez Doncel, John P. McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina Bontcheva, Ian Roberts, Andrejs Vasiljevs, Andis Lagzdiņš

Abstract: With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the a… ▽ More With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER. △ Less

Submitted 17 April, 2020; originally announced April 2020.

Comments: Proceedings of the 1st International Workshop on Language Technology Platforms (IWLTP 2020). To appear

arXiv:2003.13833 [pdf]

The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

Authors: Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin , et al. (22 additional authors not shown)

Abstract: Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitu… ▽ More Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions. △ Less

Submitted 30 March, 2020; originally announced March 2020.