-
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
Authors:
Martin Courtois,
Malte Ostendorff,
Leonhard Hennig,
Georg Rehm
Abstract:
Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training d…
▽ More
Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.
△ Less
Submitted 19 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Investigating Gender Bias in Turkish Language Models
Authors:
Orhun Caglidil,
Malte Ostendorff,
Georg Rehm
Abstract:
Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applications. However, prior research has primarily focused on the English language, especially in the context of gender bias. In particular, grammatically gender-neutral…
▽ More
Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applications. However, prior research has primarily focused on the English language, especially in the context of gender bias. In particular, grammatically gender-neutral languages such as Turkish are underexplored despite representing different linguistic properties to language models with possibly different effects on biases. In this paper, we fill this research gap and investigate the significance of gender bias in Turkish language models. We build upon existing bias evaluation frameworks and extend them to the Turkish language by translating existing English tests and creating new ones designed to measure gender bias in the context of Türkiye. Specifically, we also evaluate Turkish language models for their embedded ethnic bias toward Kurdish people. Based on the experimental results, we attribute possible biases to different model characteristics such as the model size, their multilingualism, and the training corpora. We make the Turkish gender bias dataset publicly available.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph
Authors:
Raia Abu Ahmad,
Jennifer D'Souza,
Matthäus Zloch,
Wolfgang Otto,
Georg Rehm,
Allard Oelen,
Stefan Dietze,
Sören Auer
Abstract:
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research…
▽ More
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Authors:
Malte Ostendorff,
Georg Rehm
Abstract:
Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address t…
▽ More
Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlap** vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
User Experience Design for Automatic Credibility Assessment of News Content About COVID-19
Authors:
Konstantin Schulz,
Jens Rauenbusch,
Jan Fillies,
Lisa Rutenburg,
Dimitrios Karvelas,
Georg Rehm
Abstract:
The increasingly rapid spread of information about COVID-19 on the web calls for automatic measures of quality assurance. In that context, we check the credibility of news content using selected linguistic features. We present two empirical studies to evaluate the usability of graphical interfaces that offer such credibility assessment. In a moderated qualitative interview with six participants, w…
▽ More
The increasingly rapid spread of information about COVID-19 on the web calls for automatic measures of quality assurance. In that context, we check the credibility of news content using selected linguistic features. We present two empirical studies to evaluate the usability of graphical interfaces that offer such credibility assessment. In a moderated qualitative interview with six participants, we identify rating scale, sub-criteria and algorithm authorship as important predictors of the usability. A subsequent quantitative online survey with 50 participants reveals a conflict between transparency and conciseness in the interface design, as well as a perceived hierarchy of metadata: the authorship of a news text is more important than the authorship of the credibility algorithm used to assess the content quality. Finally, we make suggestions for future research, such as proactively documenting credibility-related metadata for Natural Language Processing and Language Technology services and establishing an explicit hierarchical taxonomy of usability predictors for automatic credibility assessment.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
Specialized Document Embeddings for Aspect-based Similarity of Research Papers
Authors:
Malte Ostendorff,
Till Blume,
Terry Ruas,
Bela Gipp,
Georg Rehm
Abstract:
Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been…
▽ More
Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t.the corpus size. In an empirical study, we use the Papers with Code corpus containing 157,606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information
Authors:
Qian Ruan,
Malte Ostendorff,
Georg Rehm
Abstract:
Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical stru…
▽ More
Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model based on a pre-trained, encoder-only Transformer language model (HiStruct+ model), which improves SOTA ROUGEs for extractive summarization on PubMed and arXiv substantially. Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively, which differs from our model only in that the hierarchical structure information is not injected. It is also observed that the more conspicuous hierarchical structure the dataset has, the larger improvements our method gains. The ablation study demonstrates that the hierarchical position information is the main contributor to our model's SOTA performance.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
Authors:
Malte Ostendorff,
Nils Rethmeier,
Isabelle Augenstein,
Bela Gipp,
Georg Rehm
Abstract:
Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-i…
▽ More
Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.
△ Less
Submitted 19 October, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
Deep Learning-Based Detection of the Acute Respiratory Distress Syndrome: What Are the Models Learning?
Authors:
Gregory B. Rehm,
Chao Wang,
Irene Cortes-Puch,
Chen-Nee Chuah,
Jason Adams
Abstract:
The acute respiratory distress syndrome (ARDS) is a severe form of hypoxemic respiratory failure with in-hospital mortality of 35-46%. High mortality is thought to be related in part to challenges in making a prompt diagnosis, which may in turn delay implementation of evidence-based therapies. A deep neural network (DNN) algorithm utilizing unbiased ventilator waveform data (VWD) may help to impro…
▽ More
The acute respiratory distress syndrome (ARDS) is a severe form of hypoxemic respiratory failure with in-hospital mortality of 35-46%. High mortality is thought to be related in part to challenges in making a prompt diagnosis, which may in turn delay implementation of evidence-based therapies. A deep neural network (DNN) algorithm utilizing unbiased ventilator waveform data (VWD) may help to improve screening for ARDS. We first show that a convolutional neural network-based ARDS detection model can outperform prior work with random forest models in AUC (0.95+/-0.019 vs. 0.88+/-0.064), accuracy (0.84+/-0.026 vs 0.80+/-0.078), and specificity (0.81+/-0.06 vs 0.71+/-0.089). Frequency ablation studies imply that our model can learn features from low frequency domains typically used for expert feature engineering, and high-frequency information that may be difficult to manually featurize. Further experiments suggest that subtle, high-frequency components of physiologic signals may explain the superior performance of DL models over traditional ML when using physiologic waveform data. Our observations may enable improved interpretability of DL-based physiologic models and may improve the understanding of how high-frequency information in physiologic data impacts the performance our DL model.
△ Less
Submitted 25 September, 2021;
originally announced September 2021.
-
Clinical Validation of Single-Chamber Model-Based Algorithms Used to Estimate Respiratory Compliance
Authors:
Gregory Rehm,
Jimmy Nguyen,
Chelsea Gilbeau,
Marc T Bomactao,
Chen-Nee Chuah,
Jason Adams
Abstract:
Non-invasive estimation of respiratory physiology using computational algorithms promises to be a valuable technique for future clinicians to detect detrimental changes in patient pathophysiology. However, few clinical algorithms used to non-invasively analyze lung physiology have undergone rigorous validation in a clinical setting, and are often validated either using mechanical devices, or with…
▽ More
Non-invasive estimation of respiratory physiology using computational algorithms promises to be a valuable technique for future clinicians to detect detrimental changes in patient pathophysiology. However, few clinical algorithms used to non-invasively analyze lung physiology have undergone rigorous validation in a clinical setting, and are often validated either using mechanical devices, or with small clinical validation datasets using 2-8 patients. This work aims to improve this situation by first, establishing an open, and clinically validated dataset comprising data from both mechanical lungs and nearly 40,000 breaths from 18 intubated patients. Next, we use this data to evaluate 15 different algorithms that use the "single chamber" model of estimating respiratory compliance. We evaluate these algorithms under varying clinical scenarios patients typically experience during hospitalization. In particular, we explore algorithm performance under four different types of patient ventilator asynchrony. We also analyze algorithms under varying ventilation modes to benchmark algorithm performance and to determine if ventilation mode has any impact on the algorithm. Our approach yields several advances by 1) showing which specific algorithms work best clinically under varying mode and asynchrony scenarios, 2) develo** a simple mathematical method to reduce variance in algorithmic results, and 3) presenting additional insights about single-chamber model algorithms. We hope that our paper, approach, dataset, and software framework can thus be used by future researchers to improve their work and allow future integration of "single chamber" algorithms into clinical practice.
△ Less
Submitted 19 September, 2021;
originally announced September 2021.
-
Evaluating Document Representations for Content-based Legal Literature Recommendations
Authors:
Malte Ostendorff,
Elliott Ash,
Terry Ruas,
Bela Gipp,
Julian Moreno-Schneider,
Georg Rehm
Abstract:
Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datase…
▽ More
Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datasets. Thus, these studies have limited reproducibility. To address the gap between research and practice, we explore a set of state-of-the-art document representation methods for the task of retrieving semantically related US case law. We evaluate text-based (e.g., fastText, Transformers), citation-based (e.g., DeepWalk, Poincaré), and hybrid methods. We compare in total 27 methods using two silver standards with annotations for 2,964 documents. The silver standards are newly created from Open Case Book and Wikisource and can be reused under an open license facilitating reproducibility. Our experiments show that document representations from averaged fastText word vectors (trained on legal corpora) yield the best results, closely followed by Poincaré citation embeddings. Combining fastText and Poincaré in a hybrid manner further improves the overall result. Besides the overall performance, we analyze the methods depending on document length, citation count, and the coverage of their recommendations. We make our source code, models, and datasets publicly available at https://github.com/malteos/legal-document-similarity/.
△ Less
Submitted 28 April, 2021;
originally announced April 2021.
-
Aspect-based Document Similarity for Research Papers
Authors:
Malte Ostendorff,
Terry Ruas,
Till Blume,
Bela Gipp,
Georg Rehm
Abstract:
Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classifi…
▽ More
Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Multi-Array Electron Beam Stabilization using Block-Circulant Transformation and Generalized Singular Value Decomposition
Authors:
Idris Kempf,
Stephen R. Duncan,
Paul J. Goulart,
Guenther Rehm
Abstract:
We introduce a novel structured controller design for the electron beam stabilization problem of the UK's national synchrotron light source. Because changes to the synchrotron will not allow the application of existing control approaches, we develop a novel method to diagonalize the multi-input multi-output (MIMO) system. A generalized singular value decomposition (GSVD) is used to simultaneously…
▽ More
We introduce a novel structured controller design for the electron beam stabilization problem of the UK's national synchrotron light source. Because changes to the synchrotron will not allow the application of existing control approaches, we develop a novel method to diagonalize the multi-input multi-output (MIMO) system. A generalized singular value decomposition (GSVD) is used to simultaneously diagonalize the actuator response matrices, which is applicable to an arbitrary number of actuator dynamics in a cross-directional setting. The resulting decoupled systems are regulated using mid-ranged control and the controller gains derived as a function of the generalized singular values. In addition, we exploit the inherent block-circulant symmetry of the system. The performance of our controller is demonstrated using simulations that involve machine data.
△ Less
Submitted 1 September, 2020;
originally announced September 2020.
-
Symmetry Exploitation in Orbit Feedback Systems of Synchrotron Storage Rings
Authors:
Idris Kempf,
Paul J. Goulart,
Stephen R. Duncan,
Guenther Rehm
Abstract:
Structural symmetries in the storage ring of synchrotrons are intentionally created during the design phase of the magnetic lattices, but they are not considered in the design of control algorithms that stabilize the beam of accelerated particles. The choice of control algorithm, however, is limited by the speed requirements of the synchrotron. Standard control algorithms for synchrotrons are base…
▽ More
Structural symmetries in the storage ring of synchrotrons are intentionally created during the design phase of the magnetic lattices, but they are not considered in the design of control algorithms that stabilize the beam of accelerated particles. The choice of control algorithm, however, is limited by the speed requirements of the synchrotron. Standard control algorithms for synchrotrons are based on a singular value decomposition (SVD) of the orbit response matrix. SVD controllers neither exploit the structural symmetries nor exhibit any speed advantages. Based on the periodicity and the reflection properties of the betatron function, we show that these structural symmetries are inherited by the orbit response matrix. We show that the resulting block-circulant and centrosymmetric properties of the matrix can be used for different computationally efficient decompositions of the controller. We also address the case of broken symmetry due to odd placements of magnets and monitors. Our efficient decomposition could enable the use of more advanced control techniques for synchrotrons, such as control algorithms that require real-time optimization. These advanced control techniques could in turn increase the quality of research in synchrotron light sources.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
-
A Workflow Manager for Complex NLP and Content Curation Pipelines
Authors:
Julián Moreno-Schneider,
Peter Bourgonje,
Florian Kintzel,
Georg Rehm
Abstract:
We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on…
▽ More
We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on its custom definition language, explaining the communication components and the general system architecture and setup. We currently implement the system, which is grounded and motivated by real-world industry use cases in several innovation and transfer projects.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
QURATOR: Innovative Technologies for Content and Data Curation
Authors:
Georg Rehm,
Peter Bourgonje,
Stefanie Hegele,
Florian Kintzel,
Julián Moreno Schneider,
Malte Ostendorff,
Karolina Zaczynska,
Armin Berger,
Stefan Grill,
Sören Räuchle,
Jens Rauenbusch,
Lisa Rutenburg,
André Schmidt,
Mikka Wild,
Henry Hoffmann,
Julian Fink,
Sarah Schulz,
Jurica Seva,
Joachim Quantz,
Joachim Böttger,
Josefine Matthey,
Rolf Fricke,
Jan Thomsen,
Adrian Paschke,
Jamal Al Qundus
, et al. (15 additional authors not shown)
Abstract:
In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industr…
▽ More
In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industry and a broad range of expertise in AI, Machine Learning and Language Technologies, the QURATOR project, funded by the German Federal Ministry of Education and Research, develops a sustainable and innovative technology platform that provides services to support knowledge workers in various industries to address the challenges they face when curating digital content. The project's vision and ambition is to establish an ecosystem for content curation technologies that significantly pushes the current state of the art and transforms its region, the metropolitan area Berlin-Brandenburg, into a global centre of excellence for curation technologies.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
Towards Discourse Parsing-inspired Semantic Storytelling
Authors:
Georg Rehm,
Karolina Zaczynska,
Julián Moreno-Schneider,
Malte Ostendorff,
Peter Bourgonje,
Maria Berger,
Jens Rauenbusch,
André Schmidt,
Mikka Wild
Abstract:
Previous work of ours on Semantic Storytelling uses text analytics procedures including Named Entity Recognition and Event Detection. In this paper, we outline our longer-term vision on Semantic Storytelling and describe the current conceptual and technical approach. In the project that drives our research we develop AI-based technologies that are verified by partners from industry. One long-term…
▽ More
Previous work of ours on Semantic Storytelling uses text analytics procedures including Named Entity Recognition and Event Detection. In this paper, we outline our longer-term vision on Semantic Storytelling and describe the current conceptual and technical approach. In the project that drives our research we develop AI-based technologies that are verified by partners from industry. One long-term goal is the development of an approach for Semantic Storytelling that has broad coverage and that is, furthermore, robust. We provide first results on experiments that involve discourse parsing, applied to a concrete use case, "Explore the Neighbourhood!", which is based on a semi-automatically collected data set with documents about noteworthy people in one of Berlin's districts. Though automatically obtaining annotations for coherence relations from plain text is a non-trivial challenge, our preliminary results are promising. We envision our approach to be combined with additional features (NER, coreference resolution, knowledge graphs
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
Observations on Annotations
Authors:
Georg Rehm
Abstract:
The annotation of textual information is a fundamental activity in Linguistics and Computational Linguistics. This article presents various observations on annotations. It approaches the topic from several angles including Hypertext, Computational Linguistics and Language Technology, Artificial Intelligence and Open Science. Annotations can be examined along different dimensions. In terms of compl…
▽ More
The annotation of textual information is a fundamental activity in Linguistics and Computational Linguistics. This article presents various observations on annotations. It approaches the topic from several angles including Hypertext, Computational Linguistics and Language Technology, Artificial Intelligence and Open Science. Annotations can be examined along different dimensions. In terms of complexity, they can range from trivial to highly sophisticated, in terms of maturity from experimental to standardised. Annotations can be annotated themselves using more abstract annotations. Primary research data such as, e.g., text documents can be annotated on different layers concurrently, which are independent but can be exploited using multi-layer querying. Standards guarantee interoperability and reusability of data sets. The chapter concludes with four final observations, formulated as research questions or rather provocative remarks on the current state of annotation research.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability
Authors:
Georg Rehm,
Dimitrios Galanis,
Penny Labropoulou,
Stelios Piperidis,
Martin Welß,
Ricardo Usbeck,
Joachim Köhler,
Miltos Deligiannis,
Katerina Gkirtzou,
Johannes Fischer,
Christian Chiarcos,
Nils Feldhus,
Julián Moreno-Schneider,
Florian Kintzel,
Elena Montiel,
Víctor Rodríguez Doncel,
John P. McCrae,
David Laqua,
Irina Patricia Theile,
Christian Dittmar,
Kalina Bontcheva,
Ian Roberts,
Andrejs Vasiljevs,
Andis Lagzdiņš
Abstract:
With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the a…
▽ More
With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.
-
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Authors:
Georg Rehm,
Katrin Marheinecke,
Stefanie Hegele,
Stelios Piperidis,
Kalina Bontcheva,
Jan Hajič,
Khalid Choukri,
Andrejs Vasiļjevs,
Gerhard Backfried,
Christoph Prinz,
José Manuel Gómez Pérez,
Luc Meertens,
Paul Lukowicz,
Josef van Genabith,
Andrea Lösch,
Philipp Slusallek,
Morten Irgens,
Patrick Gatellier,
Joachim Köhler,
Laure Le Bars,
Dimitra Anastasiou,
Albina Auksoriūtė,
Núria Bel,
António Branco,
Gerhard Budin
, et al. (22 additional authors not shown)
Abstract:
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitu…
▽ More
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
European Language Grid: An Overview
Authors:
Georg Rehm,
Maria Berger,
Ela Elsholz,
Stefanie Hegele,
Florian Kintzel,
Katrin Marheinecke,
Stelios Piperidis,
Miltos Deligiannis,
Dimitris Galanis,
Katerina Gkirtzou,
Penny Labropoulou,
Kalina Bontcheva,
David Jones,
Ian Roberts,
Jan Hajic,
Jana Hamrlová,
Lukáš Kačena,
Khalid Choukri,
Victoria Arranz,
Andrejs Vasiļjevs,
Orians Anvari,
Andis Lagzdiņš,
Jūlija Meļņika,
Gerhard Backfried,
Erinç Dikici
, et al. (11 additional authors not shown)
Abstract:
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented, by nation states, lang…
▽ More
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented, by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 National Competence Centres (NCCs) and the European LT Council (LTC) for outreach and coordination purposes.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
Making Metadata Fit for Next Generation Language Technology Platforms: The Metadata Schema of the European Language Grid
Authors:
Penny Labropoulou,
Katerina Gkirtzou,
Maria Gavriilidou,
Miltos Deligiannis,
Dimitrios Galanis,
Stelios Piperidis,
Georg Rehm,
Maria Berger,
Valérie Mapelli,
Mickaël Rigault,
Victoria Arranz,
Khalid Choukri,
Gerhard Backfried,
José Manuel Gómez Pérez,
Andres Garcia Silva
Abstract:
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the description of Language Resources and Technologies…
▽ More
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the description of Language Resources and Technologies (processing and generation services and tools, models, corpora, term lists, etc.), as well as related entities (e.g., organizations, projects, supporting documents, etc.). The schema powers the European Language Grid platform that aims to be the primary hub and marketplace for industry-relevant Language Technology in Europe. ELG-SHARE has been based on various metadata schemas, vocabularies, and ontologies, as well as related recommendations and guidelines.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
Named Entities in Medical Case Reports: Corpus and Experiments
Authors:
Sarah Schulz,
Jurica Ševa,
Samuel Rodriguez,
Malte Ostendorff,
Georg Rehm
Abstract:
We present a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library. In the case reports, we annotate cases, conditions, findings, factors and negation modifiers. Moreover, where applicable, we annotate relations between these entities. As such, this is the first corpus of this kind made available to the scientific community in…
▽ More
We present a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library. In the case reports, we annotate cases, conditions, findings, factors and negation modifiers. Moreover, where applicable, we annotate relations between these entities. As such, this is the first corpus of this kind made available to the scientific community in English. It enables the initial investigation of automatic information extraction from case reports through tasks like Named Entity Recognition, Relation Extraction and (sentence/paragraph) relevance detection. Additionally, we present four strong baseline systems for the detection of medical entities made available through the annotated dataset.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling
Authors:
Dmitrii Aksenov,
Julián Moreno-Schneider,
Peter Bourgonje,
Robert Schwarzenberg,
Leonhard Hennig,
Georg Rehm
Abstract:
We explore to what extent knowledge about the pre-trained language model that is used is beneficial for the task of abstractive summarization. To this end, we experiment with conditioning the encoder and decoder of a Transformer-based neural model on the BERT language model. In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT win…
▽ More
We explore to what extent knowledge about the pre-trained language model that is used is beneficial for the task of abstractive summarization. To this end, we experiment with conditioning the encoder and decoder of a Transformer-based neural model on the BERT language model. In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size. We also explore how locality modelling, i.e., the explicit restriction of calculations to the local context, can affect the summarization ability of the Transformer. This is done by introducing 2-dimensional convolutional self-attention into the first layers of the encoder. The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset. We additionally train our model on the SwissText dataset to demonstrate usability on German. Both models outperform the baseline in ROUGE scores on two datasets and show its superiority in a manual qualitative analysis.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
A Dataset of German Legal Documents for Named Entity Recognition
Authors:
Elena Leitner,
Georg Rehm,
Julián Moreno-Schneider
Abstract:
We describe a dataset developed for Named Entity Recognition in German federal court decisions. It consists of approx. 67,000 sentences with over 2 million tokens. The resource contains 54,000 manually annotated entities, mapped to 19 fine-grained semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European le…
▽ More
We describe a dataset developed for Named Entity Recognition in German federal court decisions. It consists of approx. 67,000 sentences with over 2 million tokens. The resource contains 54,000 manually annotated entities, mapped to 19 fine-grained semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, brand, law, ordinance, European legal norm, regulation, contract, court decision, and legal literature. The legal documents were, furthermore, automatically annotated with more than 35,000 TimeML-based time expressions. The dataset, which is available under a CC-BY 4.0 license in the CoNNL-2002 format, was developed for training an NER service for German legal documents in the EU project Lynx.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
Orchestrating NLP Services for the Legal Domain
Authors:
Julián Moreno-Schneider,
Georg Rehm,
Elena Montiel-Ponsoda,
Víctor Rodriguez-Doncel,
Artem Revenko,
Sotirios Karampatakis,
Maria Khvalchik,
Christian Sageder,
Jorge Gracia,
Filippo Maganza
Abstract:
Legal technology is currently receiving a lot of attention from various angles. In this contribution we describe the main technical components of a system that is currently under development in the European innovation project Lynx, which includes partners from industry and research. The key contribution of this paper is a workflow manager that enables the flexible orchestration of workflows based…
▽ More
Legal technology is currently receiving a lot of attention from various angles. In this contribution we describe the main technical components of a system that is currently under development in the European innovation project Lynx, which includes partners from industry and research. The key contribution of this paper is a workflow manager that enables the flexible orchestration of workflows based on a portfolio of Natural Language Processing and Content Curation services as well as a Multilingual Legal Knowledge Graph that contains semantic information and meaningful references to legal documents. We also describe different use cases with which we experiment and develop prototypical solutions.
△ Less
Submitted 28 March, 2020;
originally announced March 2020.
-
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles
Authors:
Malte Ostendorff,
Terry Ruas,
Moritz Schubotz,
Georg Rehm,
Bela Gipp
Abstract:
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between do…
▽ More
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
△ Less
Submitted 22 March, 2020;
originally announced March 2020.
-
Enriching BERT with Knowledge Graph Embeddings for Document Classification
Authors:
Malte Ostendorff,
Peter Bourgonje,
Maria Berger,
Julian Moreno-Schneider,
Georg Rehm,
Bela Gipp
Abstract:
In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classific…
▽ More
In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available
△ Less
Submitted 18 September, 2019;
originally announced September 2019.
-
Improving Mechanical Ventilator Clinical Decision Support Systems with A Machine Learning Classifier for Determining Ventilator Mode
Authors:
Gregory B. Rehm,
Brooks T. Kuhn,
Jimmy Nguyen,
Nicholas R. Anderson,
Chen-Nee Chuah,
Jason Y. Adams
Abstract:
Clinical decision support systems (CDSS) will play an in-creasing role in improving the quality of medical care for critically ill patients. However, due to limitations in current informatics infrastructure, CDSS do not always have com-plete information on state of supporting physiologic monitor-ing devices, which can limit the input data available to CDSS. This is especially true in the use case…
▽ More
Clinical decision support systems (CDSS) will play an in-creasing role in improving the quality of medical care for critically ill patients. However, due to limitations in current informatics infrastructure, CDSS do not always have com-plete information on state of supporting physiologic monitor-ing devices, which can limit the input data available to CDSS. This is especially true in the use case of mechanical ventilation (MV), where current CDSS have no knowledge of critical ventilation settings, such as ventilation mode. To enable MV CDSS to make accurate recommendations related to ventilator mode, we developed a highly performant ma-chine learning model that is able to perform per-breath clas-sification of 5 of the most widely used ventilation modes in the USA with an average F1-score of 97.52%. We also show how our approach makes methodologic improvements over previous work and that it is highly robust to missing data caused by software/sensor error.
△ Less
Submitted 29 April, 2019;
originally announced April 2019.
-
Mobile Encryption Gateway (MEG) for Email Encryption
Authors:
Gregory B Rehm,
Michael Thompson,
Brad Busenius,
Jennifer Fowler
Abstract:
Email cryptography applications often suffer from major problems that prevent their widespread implementation. MEG, or the Mobile Encryption Gateway aims to fix the issues associated with email encryption by ensuring that encryption is easy to perform while still maintaining data security. MEG performs automatic decryption and encryption of all emails using PGP. Users do not need to understand the…
▽ More
Email cryptography applications often suffer from major problems that prevent their widespread implementation. MEG, or the Mobile Encryption Gateway aims to fix the issues associated with email encryption by ensuring that encryption is easy to perform while still maintaining data security. MEG performs automatic decryption and encryption of all emails using PGP. Users do not need to understand the internal workings of the encryption process to use the application. MEG is meant to be email-client-agnostic, enabling users to employ virtually any email service to send messages. Encryption actions are performed on the user's mobile device, which means their keys and data remain personal. MEG can also tackle network effect problems by inviting non-users to join. Most importantly, MEG uses end-to-end encryption, which ensures that all aspects of the encrypted information remains private. As a result, we are hopeful that MEG will finally solve the problem of practical email encryption.
△ Less
Submitted 6 November, 2017;
originally announced November 2017.