Skip to main content

Showing 1–29 of 29 results for author: Leser, U

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.12372  [pdf, other

    cs.CL

    HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

    Authors: Mario Sänger, Samuele Garda, Xing David Wang, Leon Weber-Genzel, Pia Droop, Benedikt Fuchs, Alan Akbik, Ulf Leser

    Abstract: With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. H… ▽ More

    Submitted 20 February, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

  2. arXiv:2401.05125  [pdf, other

    cs.CL

    BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation

    Authors: Samuele Garda, Ulf Leser

    Abstract: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

  3. The Common Workflow Scheduler Interface: Status Quo and Future Plans

    Authors: Fabian Lehmann, Jonathan Bader, Lauritz Thamsen, Ulf Leser

    Abstract: Nowadays, many scientific workflows from different domains, such as Remote Sensing, Astronomy, and Bioinformatics, are executed on large computing infrastructures managed by resource managers. Scientific workflow management systems (SWMS) support the workflow execution and communicate with the infrastructures' resource managers. However, the communication between SWMS and resource managers is comp… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Journal ref: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 2023)

  4. Large Language Models to the Rescue: Reducing the Complexity in Scientific Workflow Development Using ChatGPT

    Authors: Mario Sänger, Ninon De Mecquenem, Katarzyna Ewa Lewińska, Vasilis Bountris, Fabian Lehmann, Ulf Leser, Thomas Kosch

    Abstract: Scientific workflow systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets, as they offer reproducibility, dependability, and scalability of analyses by automatic parallelization on large compute clusters. However, implementing workflows is difficult due to the involvement of many black-box tools and the deep infrastructure stack necessary… ▽ More

    Submitted 6 November, 2023; v1 submitted 3 November, 2023; originally announced November 2023.

    Journal ref: S{ä}nger et. al: A qualitative assessment of using ChatGPT as large language model for scientific workflow development, GigaScience, Volume 13, 2024, giae030

  5. arXiv:2310.20431  [pdf, other

    cs.LG cs.AI cs.DB

    Raising the ClaSS of Streaming Time Series Segmentation

    Authors: Arik Ermshaus, Patrick Schäfer, Ulf Leser

    Abstract: Ubiquitous sensors today emit high frequency streams of numerical measurements that reflect properties of human, animal, industrial, commercial, and natural processes. Shifts in such processes, e.g. caused by external events or internal state changes, manifest as changes in the recorded signals. The task of streaming time series segmentation (STSS) is to partition the stream into consecutive varia… ▽ More

    Submitted 26 April, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

  6. Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures

    Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Ulf Leser, Odej Kao

    Abstract: Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific ma… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Journal ref: Future Generation Computer Systems, Volume 150, January 2024, Pages 171-185

  7. arXiv:2308.11537  [pdf, other

    cs.CL

    BELB: a Biomedical Entity Linking Benchmark

    Authors: Samuele Garda, Leon Weber-Genzel, Robert Martin, Ulf Leser

    Abstract: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on publish… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

  8. arXiv:2305.08409  [pdf, other

    cs.DC

    Validity Constraints for Data Analysis Workflows

    Authors: Florian Schintke, Ninon De Mecquenem, David Frantz, Vanessa Emanuela Guarino, Marcus Hilbrich, Fabian Lehmann, Rebecca Sattler, Jan Arne Sparka, Daniel Speckhard, Hermann Stolte, Anh Duc Vu, Ulf Leser

    Abstract: Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden a… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

  9. How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface

    Authors: Fabian Lehmann, Jonathan Bader, Friedrich Tschirpke, Lauritz Thamsen, Ulf Leser

    Abstract: Scientific workflow management systems (SWMSs) and resource managers together ensure that tasks are scheduled on provisioned resources so that all dependencies are obeyed, and some optimization goal, such as makespan minimization, is achieved. In practice, however, there is no clear separation of scheduling responsibilities between an SWMS and a resource manager because there exists no agreed-upon… ▽ More

    Submitted 13 July, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

    Journal ref: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

  10. arXiv:2301.10194  [pdf, other

    cs.LG

    WEASEL 2.0 -- A Random Dilated Dictionary Transform for Fast, Accurate and Memory Constrained Time Series Classification

    Authors: Patrick Schäfer, Ulf Leser

    Abstract: A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important components of the currently most accurate TSC ensem… ▽ More

    Submitted 1 February, 2023; v1 submitted 24 January, 2023; originally announced January 2023.

  11. Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

    Authors: Jonathan Bader, Fabian Lehmann, Alexander Groth, Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Ulf Leser, Odej Kao

    Abstract: Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in… ▽ More

    Submitted 17 October, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: Paper accepted in 41st IEEE International Performance Computing and Communications Conference (IPCCC 2022)

  12. ClaSP -- Parameter-free Time Series Segmentation

    Authors: Arik Ermshaus, Patrick Schäfer, Ulf Leser

    Abstract: The study of natural and human-made processes often results in long sequences of temporally-ordered values, aka time series (TS). Such processes often consist of multiple states, e.g. operating modes of a machine, such that state changes in the observed processes result in changes in the distribution of shape of the measured values. Time series segmentation (TSS) tries to find such changes in TS p… ▽ More

    Submitted 18 November, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

  13. arXiv:2206.03735  [pdf, other

    cs.LG cs.AI cs.DB cs.DS

    Motiflets -- Simple and Accurate Detection of Motifs in Time Series

    Authors: Patrick Schäfer, Ulf Leser

    Abstract: A time series motif intuitively is a short time series that repeats itself approximately the same within a larger time series. Such motifs often represent concealed structures, such as heart beats in an ECG recording, the riff in a pop song, or sleep spindles in EEG sleep data. Motif discovery (MD) is the task of finding such motifs in a given input series. As there are varying definitions of what… ▽ More

    Submitted 16 April, 2024; v1 submitted 8 June, 2022; originally announced June 2022.

  14. Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

    Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf Leser, Odej Kao

    Abstract: Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure chan… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: paper accepted in 34th International Conference on Scientific and Statistical Database Management (SSDBM 2022)

  15. A Community Roadmap for Scientific Workflows Research and Development

    Authors: Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Ilkay Altintas, Rosa M Badia, Bartosz Balis, Tainã Coleman, Frederik Coppens, Frank Di Natale, Bjoern Enders, Thomas Fahringer, Rosa Filgueira, Grigori Fursin, Daniel Garijo, Carole Goble, Dorran Howell, Shantenu Jha, Daniel S. Katz, Daniel Laney, Ulf Leser, Maciej Malawski, Kshitij Mehta, Loïc Pottier, Jonathan Ozik, J. Luc Peterson , et al. (4 additional authors not shown)

    Abstract: The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows… ▽ More

    Submitted 8 October, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2103.09181

  16. Workflows Community Summit: Advancing the State-of-the-art of Scientific Workflows Management Systems Research and Development

    Authors: Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Tainã Coleman, Dan Laney, Dong Ahn, Shantenu Jha, Dorran Howell, Stian Soiland-Reys, Ilkay Altintas, Douglas Thain, Rosa Filgueira, Yadu Babuji, Rosa M. Badia, Bartosz Balis, Silvina Caino-Lores, Scott Callaghan, Frederik Coppens, Michael R. Crusoe, Kaushik De, Frank Di Natale, Tu M. A. Do, Bjoern Enders, Thomas Fahringer, Anne Fouilloux , et al. (33 additional authors not shown)

    Abstract: Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role i… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

  17. arXiv:2008.10856  [pdf, other

    cs.CL

    TabSim: A Siamese Neural Network for Accurate Estimation of Table Similarity

    Authors: Maryam Habibi, Johannes Starlinger, Ulf Leser

    Abstract: Tables are a popular and efficient means of presenting structured information. They are used extensively in various kinds of documents including web pages. Tables display information as a two-dimensional matrix, the semantics of which is conveyed by a mixture of structure (rows, columns), headers, caption, and content. Recent research has started to consider tables as first class objects, not just… ▽ More

    Submitted 25 August, 2020; originally announced August 2020.

  18. arXiv:2008.07347  [pdf, other

    cs.CL

    HunFlair: An Easy-to-Use Tool for State-of-the-Art Biomedical Named Entity Recognition

    Authors: Leon Weber, Mario Sänger, Jannes Münchmeyer, Maryam Habibi, Ulf Leser, Alan Akbik

    Abstract: Summary: Named Entity Recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, highly accurate, and robust towards variations in text genre and style. To this end, we propose HunFlair, an NER tagger covering multiple entity types integrated into the widely used NLP framework Flair. HunFlair outperforms… ▽ More

    Submitted 18 August, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

    Comments: - Corrected author list - Updated project link

  19. arXiv:2006.03104  [pdf

    cs.DC

    Portability of Scientific Workflows in NGS Data Analysis: A Case Study

    Authors: Christopher Schiefer, Marc Bux, Joergen Brandt, Clemens Messerschmidt, Knut Reinert, Dieter Beule, Ulf Leser

    Abstract: The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a workflow developed for a particular system on a partic… ▽ More

    Submitted 4 June, 2020; originally announced June 2020.

  20. arXiv:1908.03405  [pdf, other

    cs.LG stat.ML

    TEASER: Early and Accurate Time Series Classification

    Authors: P. Schäfer, U. Leser

    Abstract: Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification problem easier but delays the time in which a classifica… ▽ More

    Submitted 16 August, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

  21. arXiv:1906.06187  [pdf, other

    cs.CL cs.LO

    NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language

    Authors: Leon Weber, Pasquale Minervini, Jannes Münchmeyer, Ulf Leser, Tim Rocktäschel

    Abstract: Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by learning distributed representations of words and… ▽ More

    Submitted 14 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  22. Finding k-Dissimilar Paths with Minimum Collective Length

    Authors: Theodoros Chondrogiannis, Panagiotis Bouros, Johann Gamper, Ulf Leser, David B. Blumenthal

    Abstract: Shortest path computation is a fundamental problem in road networks. However, in many real-world scenarios, determining solely the shortest path is not enough. In this paper, we study the problem of finding k-Dissimilar Paths with Minimum Collective Length (kDPwML), which aims at computing a set of paths from a source s to a target t such that all paths are pairwise dissimilar by at least θand the… ▽ More

    Submitted 24 October, 2018; v1 submitted 18 September, 2018; originally announced September 2018.

    Comments: Extended version of the SIGSPATIAL'18 paper under the same title

    Journal ref: 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS 2018), Seattle, Washington, USA, November 6-9, 2018

  23. Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning

    Authors: Carl Witt, Marc Bux, Wladislaw Gusew, Ulf Leser

    Abstract: In many domains, the previous decade was characterized by increasing data volumes and growing complexity of computational workloads, creating new demands for highly data-parallel computing in distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g., for scheduling and resou… ▽ More

    Submitted 30 May, 2018; originally announced May 2018.

    Comments: 19 pages, 3 figures, 5 tables

    Journal ref: Information Systems 2019

  24. arXiv:1805.01646  [pdf, ps, other

    cs.CL

    Cross-lingual Candidate Search for Biomedical Concept Normalization

    Authors: Roland Roller, Madeleine Kittner, Dirk Weissenborn, Ulf Leser

    Abstract: Biomedical concept normalization links concept mentions in texts to a semantically equivalent concept in a biomedical knowledge base. This task is challenging as concepts can have different expressions in natural languages, e.g. paraphrases, which are not necessarily all present in the knowledge base. Concept normalization of non-English biomedical text is even more challenging as non-English reso… ▽ More

    Submitted 4 May, 2018; originally announced May 2018.

  25. arXiv:1801.03644  [pdf, other

    cs.DB

    Multidimensional Range Queries on Modern Hardware

    Authors: Stefan Sprenger, Patrick Schäfer, Ulf Leser

    Abstract: Range queries over multidimensional data are an important part of database workloads in many applications. Their execution may be accelerated by using multidimensional index structures (MDIS), such as kd-trees or R-trees. As for most index structures, the usefulness of this approach depends on the selectivity of the queries, and common wisdom told that a simple scan beats MDIS for queries accessin… ▽ More

    Submitted 14 May, 2018; v1 submitted 11 January, 2018; originally announced January 2018.

  26. arXiv:1711.11343  [pdf, other

    cs.LG

    Multivariate Time Series Classification with WEASEL+MUSE

    Authors: Patrick Schäfer, Ulf Leser

    Abstract: Multivariate time series (MTS) arise when multiple interconnected sensors record data over time. Dealing with this high-dimensional data is challenging for every classifier for at least two aspects: First, an MTS is not only characterized by individual feature values, but also by the interplay of features in different dimensions. Second, this typically adds large amounts of irrelevant data and noi… ▽ More

    Submitted 17 August, 2018; v1 submitted 30 November, 2017; originally announced November 2017.

  27. arXiv:1701.07681  [pdf, other

    cs.DS cs.LG stat.ML

    Fast and Accurate Time Series Classification with WEASEL

    Authors: Patrick Schäfer, Ulf Leser

    Abstract: Time series (TS) occur in many scientific and commercial applications, ranging from earth surveillance to industry automation to the smart grids. An important type of TS analysis is classification, which can, for instance, improve energy load forecasting in smart grids by detecting the types of electronic devices based on their energy consumption profiles recorded by automatic sensors. Such sensor… ▽ More

    Submitted 26 January, 2017; originally announced January 2017.

    Journal ref: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). ACM, 637-646

  28. arXiv:1311.6335  [pdf, other

    cs.DB

    SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows

    Authors: Astrid Rheinländer, Arvid Heise, Fabian Hueske, Ulf Leser, Felix Naumann

    Abstract: Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (UDFs). However, the heavy use of UDFs is not well taken into account for dataflow o… ▽ More

    Submitted 25 November, 2013; originally announced November 2013.

  29. arXiv:1303.7195  [pdf, other

    cs.DC

    Parallelization in Scientific Workflow Management Systems

    Authors: Marc Bux, Ulf Leser

    Abstract: Over the last two decades, scientific workflow management systems (SWfMS) have emerged as a means to facilitate the design, execution, and monitoring of reusable scientific data processing pipelines. At the same time, the amounts of data generated in various areas of science outpaced enhancements in computational power and storage capabilities. This is especially true for the life sciences, where… ▽ More

    Submitted 28 March, 2013; originally announced March 2013.

    Comments: 24 pages, 17 figures (13 PDF, 4 PNG)

    MSC Class: 68N19 ACM Class: C.1.4; D.1.3; D.3.2; J.3