Search | arXiv e-print repository

ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering

Authors: Raphael Gruber, Abdelrahman Abdallah, Michael Färber, Adam Jatowt

Abstract: We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatc… ▽ More We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatched breadth of topics. We introduce a unique taxonomy that categorizes questions as attributes, comparisons, and counting questions, each revolving around events, entities, and time periods. One standout feature of ComplexTempQA is the high complexity of its questions, which demand effective capabilities for answering such as across-time comparison, temporal aggregation, and multi-hop reasoning involving temporal event ordering and entity recognition. Additionally, each question is accompanied by detailed metadata, including specific time scopes, allowing for comprehensive evaluation and enhancement of the temporal reasoning abilities of large language models. ComplexTempQA serves both as a testing ground for develo** sophisticated AI models and as a foundation for advancing research in question answering, information retrieval, and language understanding. Dataset and code are freely available at: https://github.com/DataScienceUIBK/ComplexTempQA. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.09557 [pdf, other]

Machine Learning in Short-Reach Optical Systems: A Comprehensive Survey

Authors: Chen Shao, Elias Giacoumidis, Syed Moktacim Billah, Shi Li, Jialei Li, Prashasti Sahu, Andre Richter, Tobias Kaefer, Michael Faerber

Abstract: In recent years, extensive research has been conducted to explore the utilization of machine learning algorithms in various direct-detected and self-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic prediction, and digital signal processing (DSP)-based equa… ▽ More In recent years, extensive research has been conducted to explore the utilization of machine learning algorithms in various direct-detected and self-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic prediction, and digital signal processing (DSP)-based equalization. As a versatile approach, machine learning demonstrates the ability to address stochastic phenomena in optical systems networks where deterministic methods may fall short. However, when it comes to DSP equalization algorithms, their performance improvements are often marginal, and their complexity is prohibitively high, especially in cost-sensitive short-reach communications scenarios such as passive optical networks (PONs). They excel in capturing temporal dependencies, handling irregular or nonlinear patterns effectively, and accommodating variable time intervals. Within this extensive survey, we outline the application of machine learning techniques in short-reach communications, specifically emphasizing their utilization in high-bandwidth demanding PONs. Notably, we introduce a novel taxonomy for time-series methods employed in machine learning signal processing, providing a structured classification framework. Our taxonomy categorizes current time series methods into four distinct groups: traditional methods, Fourier convolution-based methods, transformer-based models, and time-series convolutional networks. Finally, we highlight prospective research directions within this rapidly evolving field and outline specific solutions to mitigate the complexity associated with hardware implementations. We aim to pave the way for more practical and efficient deployment of machine learning approaches in short-reach optical communication systems by addressing complexity concerns. △ Less

Submitted 29 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: 23 pages, 2 figure, 3 tables, Accepted as MDPI Photonics Journal Speical Issue Machine Learning Applied to Optical Communication Systems

arXiv:2405.02609 [pdf, other]

Advanced Equalization in 112 Gb/s Upstream PON Using a Novel Fourier Convolution-based Network

Authors: Chen Shao, Elias Giacoumidis, Patrick Matalla, Jialei Li, Shi Li, Sebastian Randel, Andre Richter, Michael Faerber, Tobias Kaefer

Abstract: We experimentally demonstrate a novel, low-complexity Fourier Convolution-based Network (FConvNet) based equalizer for 112 Gb/s upstream PAM4-PON. At a BER of 0.005, FConvNet enhances the receiver sensitivity by 2 and 1 dB compared to a 51-tap Sato equalizer and benchmark machine learning algorithms respectively. We experimentally demonstrate a novel, low-complexity Fourier Convolution-based Network (FConvNet) based equalizer for 112 Gb/s upstream PAM4-PON. At a BER of 0.005, FConvNet enhances the receiver sensitivity by 2 and 1 dB compared to a 51-tap Sato equalizer and benchmark machine learning algorithms respectively. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: 4 pages, 5 figures

arXiv:2405.00720 [pdf, other]

A Novel Machine Learning-based Equalizer for a Downstream 100G PAM-4 PON

Authors: Chen Shao, Elias Giacoumidis, Shi Li, Jialei Li, Michael Faerber, Tobias Kaefer, Andre Richter

Abstract: A frequency-calibrated SCINet (FC-SCINet) equalizer is proposed for down-stream 100G PON with 28.7 dB path loss. At 5 km, FC-SCINet improves the BER by 88.87% compared to FFE and a 3-layer DNN with 10.57% lower complexity. A frequency-calibrated SCINet (FC-SCINet) equalizer is proposed for down-stream 100G PON with 28.7 dB path loss. At 5 km, FC-SCINet improves the BER by 88.87% compared to FFE and a 3-layer DNN with 10.57% lower complexity. △ Less

Submitted 25 April, 2024; originally announced May 2024.

Comments: 3 pages, 6 figures, accepted by Optical Fiber Communications Conference and Exhibition 2024

arXiv:2404.06911 [pdf, other]

GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism

Authors: Shuzhou Yuan, Michael Färber

Abstract: Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) require cumbersome processes for integration into PLMs… ▽ More Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) require cumbersome processes for integration into PLMs. In this work, we propose a novel graph-guided self-attention mechanism, GraSAME. GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts. As an end-to-end, lightweight multimodal module, GraSAME follows a multi-task learning strategy and effectively bridges the gap between graph and textual modalities, facilitating dynamic interactions between GNNs and PLMs. Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets. Furthermore, compared to SOTA models, GraSAME eliminates the need for extra pre-training tasks to adjust graph inputs and reduces the number of trainable parameters by over 100 million. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: NAACL 2024 Findings

arXiv:2403.20132 [pdf]

A formal specification of the jq language

Authors: Michael Färber

Abstract: jq is a widely used tool that provides a programming language to manipulate JSON data. However, the jq language is currently only specified by its implementation, making it difficult to reason about its behaviour. To this end, we provide a formal syntax and denotational semantics for a large subset of the jq language. Our most significant contribution is to provide a new way to interpret updates t… ▽ More jq is a widely used tool that provides a programming language to manipulate JSON data. However, the jq language is currently only specified by its implementation, making it difficult to reason about its behaviour. To this end, we provide a formal syntax and denotational semantics for a large subset of the jq language. Our most significant contribution is to provide a new way to interpret updates that allows for more predictable and performant execution. △ Less

Submitted 29 March, 2024; originally announced March 2024.

ACM Class: D.3.1

arXiv:2403.16846 [pdf, other]

GreeDy and CoDy: Counterfactual Explainers for Dynamic Graphs

Authors: Zhan Qu, Daniel Gomm, Michael Färber

Abstract: Temporal Graph Neural Networks (TGNNs), crucial for modeling dynamic graphs with time-varying interactions, face a significant challenge in explainability due to their complex model structure. Counterfactual explanations, crucial for understanding model decisions, examine how input graph changes affect outcomes. This paper introduces two novel counterfactual explanation methods for TGNNs: GreeDy (… ▽ More Temporal Graph Neural Networks (TGNNs), crucial for modeling dynamic graphs with time-varying interactions, face a significant challenge in explainability due to their complex model structure. Counterfactual explanations, crucial for understanding model decisions, examine how input graph changes affect outcomes. This paper introduces two novel counterfactual explanation methods for TGNNs: GreeDy (Greedy Explainer for Dynamic Graphs) and CoDy (Counterfactual Explainer for Dynamic Graphs). They treat explanations as a search problem, seeking input graph alterations that alter model predictions. GreeDy uses a simple, greedy approach, while CoDy employs a sophisticated Monte Carlo Tree Search algorithm. Experiments show both methods effectively generate clear explanations. Notably, CoDy outperforms GreeDy and existing factual methods, with up to 59\% higher success rate in finding significant counterfactual inputs. This highlights CoDy's potential in clarifying TGNN decision-making, increasing their transparency and trustworthiness in practice. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.11747 [pdf, other]

Embedded Named Entity Recognition using Probing Classifiers

Authors: Nicholas Popovič, Michael Färber

Abstract: Extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose directly embedding information extraction capabilities into pre-trained lan… ▽ More Extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose directly embedding information extraction capabilities into pre-trained language models using probing classifiers, enabling efficient simultaneous text generation and information extraction. For this, we introduce an approach called EMBER and show that it enables named entity recognition in decoder-only language models without fine-tuning them and while incurring minimal additional computational cost at inference time. Specifically, our experiments using GPT-2 show that EMBER maintains high token generation rates during streaming text generation, with only a negligible decrease in speed of around 1% compared to a 43.64% slowdown measured for a baseline using a separate NER model. Code and data are available at https://github.com/nicpopovic/EMBER. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.18397 [pdf, other]

Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models

Authors: Ercong Nie, Shuzhou Yuan, Bolei Ma, Helmut Schmid, Michael Färber, Frauke Kreuter, Hinrich Schütze

Abstract: Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence la… ▽ More Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: 18 pages, 7 figures

arXiv:2402.11709 [pdf, other]

GNNavi: Navigating the Information Flow in Large Language Models by Graph Neural Network

Authors: Shuzhou Yuan, Ercong Nie, Michael Färber, Helmut Schmid, Hinrich Schütze

Abstract: Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are used. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducin… ▽ More Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are used. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducing a prompt-based parameter-efficient fine-tuning (PEFT) approach. GNNavi leverages insights into ICL's information flow dynamics, which indicates that label words act in prompts as anchors for information propagation. GNNavi employs a Graph Neural Network (GNN) layer to precisely guide the aggregation and distribution of information flow during the processing of prompts by hardwiring the desired information flow into the GNN. Our experiments on text classification tasks with GPT-2 and Llama2 show GNNavi surpasses standard prompt-based fine-tuning methods in few-shot settings by updating just 0.2% to 0.5% of parameters. We compare GNNavi with prevalent PEFT approaches, such as prefix tuning, LoRA and Adapter in terms of performance and efficiency. Our analysis reveals that GNNavi enhances information flow and ensures a clear aggregation process. △ Less

Submitted 7 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

Comments: ACL2024 Findings

arXiv:2402.11700 [pdf, other]

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Authors: Shuzhou Yuan, Ercong Nie, Bolei Ma, Michael Färber

Abstract: Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size,… ▽ More Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: 6 pages, 2 figures

arXiv:2401.16589 [pdf, other]

ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks

Authors: Bolei Ma, Ercong Nie, Shuzhou Yuan, Helmut Schmid, Michael Färber, Frauke Kreuter, Hinrich Schütze

Abstract: Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt De… ▽ More Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks. △ Less

Submitted 13 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: EACL 2024

arXiv:2312.10638 [pdf, other]

doi 10.1007/978-3-031-56060-6_17

HyperPIE: Hyperparameter Information Extraction from Scientific Publications

Authors: Tarek Saier, Mayumi Ohta, Takuto Asakura, Michael Färber

Abstract: Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter informat… ▽ More Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter information extraction (HyperPIE) as an entity recognition and relation extraction task. We create a labeled data set covering publications from a variety of computer science disciplines. Using this data set, we train and evaluate BERT-based fine-tuned models as well as five large language models: GPT-3.5, GALACTICA, Falcon, Vicuna, and WizardLM. For fine-tuned models, we develop a relation extraction approach that achieves an improvement of 29% F1 over a state-of-the-art baseline. For large language models, we develop an approach leveraging YAML output for structured data extraction, which achieves an average improvement of 5.5% F1 in entity recognition over using JSON. With our best performing model we extract hyperparameter information from a large number of unannotated papers, and analyze patterns across disciplines. All our data and source code is publicly available at https://github.com/IllDepence/hyperpie △ Less

Submitted 10 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: accepted at ECIR2024

arXiv:2312.01124 [pdf, ps, other]

Sequential topological complexity of aspherical spaces and sectional categories of subgroup inclusions

Authors: Arturo Espinosa Baro, Michael Farber, Stephan Mescher, John Oprea

Abstract: We generalize results from topological robotics on the topological complexity (TC) of aspherical spaces to sectional categories of fibrations inducing subgroup inclusions on the level of fundamental groups. In doing so, we establish new lower bounds on sequential TCs of aspherical spaces as well as the parametrized TC of epimorphisms. Moreover, we generalize the Costa-Farber canonical class for TC… ▽ More We generalize results from topological robotics on the topological complexity (TC) of aspherical spaces to sectional categories of fibrations inducing subgroup inclusions on the level of fundamental groups. In doing so, we establish new lower bounds on sequential TCs of aspherical spaces as well as the parametrized TC of epimorphisms. Moreover, we generalize the Costa-Farber canonical class for TC to classes for sequential TCs and explore their properties. We combine them with the results on sequential TCs of aspherical spaces to obtain results on spaces that are not necessarily aspherical. △ Less

Submitted 8 December, 2023; v1 submitted 2 December, 2023; originally announced December 2023.

Comments: 40 pages

MSC Class: 55M30 (68T40; 20J05)

arXiv:2310.20475 [pdf, other]

Linked Papers With Code: The Latest in Machine Learning as an RDF Knowledge Graph

Authors: Michael Färber, David Lamprecht

Abstract: In this paper, we introduce Linked Papers With Code (LPWC), an RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only tr… ▽ More In this paper, we introduce Linked Papers With Code (LPWC), an RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only translates the latest advancements in machine learning into RDF format, but also enables novel ways for scientific impact quantification and scholarly key content recommendation. LPWC is openly accessible at https://linkedpaperswithcode.com and is licensed under CC-BY-SA 4.0. As a knowledge graph in the Linked Open Data cloud, we offer LPWC in multiple formats, from RDF dump files to a SPARQL endpoint for direct web queries, as well as a data source with resolvable URIs and links to the data sources SemOpenAlex, Wikidata, and DBLP. Additionally, we supply knowledge graph embeddings, enabling LPWC to be readily applied in machine learning applications. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: Published at ISWC'23

arXiv:2310.20444 [pdf, other]

Analyzing the Impact of Companies on AI Research Based on Publications

Authors: Michael Färber, Lazaros Tampakis

Abstract: Artificial Intelligence (AI) is one of the most momentous technologies of our time. Thus, it is of major importance to know which stakeholders influence AI research. Besides researchers at universities and colleges, researchers in companies have hardly been considered in this context. In this article, we consider how the influence of companies on AI research can be made measurable on the basis of… ▽ More Artificial Intelligence (AI) is one of the most momentous technologies of our time. Thus, it is of major importance to know which stakeholders influence AI research. Besides researchers at universities and colleges, researchers in companies have hardly been considered in this context. In this article, we consider how the influence of companies on AI research can be made measurable on the basis of scientific publishing activities. We compare academic- and company-authored AI publications published in the last decade and use scientometric data from multiple scholarly databases to look for differences across these groups and to disclose the top contributing organizations. While the vast majority of publications is still produced by academia, we find that the citation count an individual publication receives is significantly higher when it is (co-)authored by a company. Furthermore, using a variety of altmetric indicators, we notice that publications with company participation receive considerably more attention online. Finally, we place our analysis results in a broader context and present targeted recommendations to safeguard a harmonious balance between academia and industry in the realm of AI research. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: Published in Scientometrics

arXiv:2309.04797 [pdf, other]

A Full-fledged Commit Message Quality Checker Based on Machine Learning

Authors: David Faragó, Michael Färber, Christian Petrov

Abstract: Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM is written, including its meaning and context. Sin… ▽ More Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM is written, including its meaning and context. Since this task is challenging, we ask the research question: how well can the CM quality, including semantics and context, be measured with machine learning methods? By considering all rules from the most popular CM quality guideline, creating datasets for those rules, and training and evaluating state-of-the-art machine learning models to check those rules, we can answer the research question with: sufficiently well for practice, with the lowest F$_1$ score of 82.9\%, for the most challenging task. We develop a full-fledged open-source framework that checks all these CM quality rules. It is useful for research, e.g., automatic CM generation, but most importantly for software practitioners to raise the quality of CMs and thus the maintainability and evolution speed of their software. △ Less

Submitted 9 September, 2023; originally announced September 2023.

Comments: published at COMPSAC'23

arXiv:2308.10595 [pdf, other]

Sequential parametrized topological complexity of sphere bundles

Authors: Michael Farber, Amit Kumar Paul

Abstract: Autonomous motion of a system (robot) is controlled by a motion planning algorithm. A sequential parametrized motion planning algorithm \cite{FP22} works under variable external conditions and generates continuous motions of the system to attain the prescribed sequence of states at prescribed moments of time. Topological complexity of such algorithms characterises their structure and discontinuiti… ▽ More Autonomous motion of a system (robot) is controlled by a motion planning algorithm. A sequential parametrized motion planning algorithm \cite{FP22} works under variable external conditions and generates continuous motions of the system to attain the prescribed sequence of states at prescribed moments of time. Topological complexity of such algorithms characterises their structure and discontinuities. Information about states of the system consistent with states of the external conditions is described by a fibration $p: E\to B$ where the base $B$ parametrises the external conditions and each fibre $p^{-1}(b)$ is the configuration space of the system constrained by external conditions $b\in B$; more detail on this approach is given below. Our main goal in this paper is to study the sequential topological complexity of sphere bundles $\dot ξ: \dot E\to B$; in other words we study {\it \lq\lq parametrized families of spheres\rq\rq} and sequential parametrized motion planning algorithms for such bundles. We use the Euler and Stiefel - Whitney characteristic classes to obtain lower bounds on the topological complexity. We illustrate our results by many explicit examples. Some related results for the special case $r=2$ were described earlier in \cite{FW23}. △ Less

Submitted 21 August, 2023; originally announced August 2023.

MSC Class: 55M30

arXiv:2308.03671 [pdf, other]

SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples

Authors: Michael Färber, David Lamprecht, Johan Krause, Linn Aung, Peter Haase

Abstract: We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF dump files, a SPARQL endpoint, and as a data source… ▽ More We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF dump files, a SPARQL endpoint, and as a data source in the Linked Open Data cloud, complete with resolvable URIs and links to other data sources. Moreover, we provide embeddings for knowledge graph entities using high-performance computing. SemOpenAlex enables a broad range of use-case scenarios, such as exploratory semantic search via our website, large-scale scientific impact quantification, and other forms of scholarly big data analytics within and across scientific disciplines. Additionally, it enables academic recommender systems, such as recommending collaborators, publications, and venues, including explainability capabilities. Finally, SemOpenAlex can serve for RDF query optimization benchmarks, creating scholarly knowledge-guided language models, and as a hub for semantic scientific publishing. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: accepted at ISWC'23

arXiv:2308.03531 [pdf, other]

Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage of the 2021 German Federal Election

Authors: Michael Färber, Jannik Schwade, Adam Jatowt

Abstract: Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news articles related to US elections. However, the question… ▽ More Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news articles related to US elections. However, the question of how diversity in news articles can be measured holistically, i.e., with respect to (1) variety, (2) balance, and (3) disparity, considering individuals, parties, and topics, has not been addressed. In this paper, we present a framework for determining diversity in news articles according to these dimensions. Furthermore, we create and provide a dataset of Google Top Stories, encompassing more than 26,000 unique headlines from more than 900 news outlets collected within two weeks before and after the 2021 German federal election. While we observe high diversity for more general search terms (e.g., "election"), a range of search terms ("education," "Europe," "climate protection," "government") resulted in news articles with high diversity in two out of three dimensions. This reflects a more subjective, dedicated discussion on rather future-oriented topics. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2308.03519 [pdf, other]

Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings

Authors: Michael Färber, Nicholas Popovic

Abstract: In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has a… ▽ More In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: accepted at RANLP'23

arXiv:2307.14712 [pdf, other]

Evaluating Generative Models for Graph-to-Text Generation

Authors: Shuzhou Yuan, Michael Färber

Abstract: Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate GPT-3 and ChatGPT on two graph-to-text datasets and… ▽ More Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate GPT-3 and ChatGPT on two graph-to-text datasets and compare their performance with that of finetuned LLM models such as T5 and BART. Our results demonstrate that generative models are capable of generating fluent and coherent text, achieving BLEU scores of 10.57 and 11.08 for the AGENDA and WebNLG datasets, respectively. However, our error analysis reveals that generative models still struggle with understanding the semantic relations between entities, and they also tend to generate text with hallucinations or irrelevant information. As a part of error analysis, we utilize BERT to detect machine-generated text and achieve high macro-F1 scores. We have made the text generated by generative models publicly available. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted as short paper in RANLP2023

arXiv:2303.15193 [pdf, other]

doi 10.1109/JCDL57899.2023.00016

CoCon: A Data Set on Combined Contextualized Research Artifact Use

Authors: Tarek Saier, Youxiang Dong, Michael Färber

Abstract: In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets. To enable more holistic analyses and systems dea… ▽ More In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets. To enable more holistic analyses and systems dealing with academic publications and their content, we propose CoCon, a large scholarly data set reflecting the combined use of research artifacts, contextualized in academic publications' full-text. Our data set comprises 35 k artifacts (data sets, methods, models, and tasks) and 340 k publications. We additionally formalize a link prediction task for "combined research artifact use prediction" and provide code to utilize analyses of and the development of ML applications on our data. All data and code is publicly available at https://github.com/IllDepence/contextgraph. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: submitted to JCDL2023

arXiv:2303.14957 [pdf, other]

doi 10.1109/JCDL57899.2023.00020

unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

Authors: Tarek Saier, Johan Krause, Michael Färber

Abstract: Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representa… ▽ More Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: submitted to JCDL2023

arXiv:2302.10576 [pdf, other]

Denotational Semantics and a Fast Interpreter for jq

Authors: Michael Färber

Abstract: jq is a widely used tool that provides a programming language to manipulate JSON data. However, its semantics are currently only specified by its implementation, making it difficult to reason about its behaviour. To this end, I provide a syntax and denotational semantics for a subset of the jq language. In particular, the semantics provide a new way to interpret updates. I implement an extended ve… ▽ More jq is a widely used tool that provides a programming language to manipulate JSON data. However, its semantics are currently only specified by its implementation, making it difficult to reason about its behaviour. To this end, I provide a syntax and denotational semantics for a subset of the jq language. In particular, the semantics provide a new way to interpret updates. I implement an extended version of the semantics in a novel interpreter for the jq language called jaq. Although jaq uses a significantly simpler approach to execute jq programs than jq, jaq is faster than jq on ten out of thirteen benchmarks. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: Submitted to OOPSLA 2023

arXiv:2301.07809 [pdf, other]

doi 10.1112/blms.12957

A Random Graph Growth Model

Authors: Michael Farber, Alexander Gnedin, Wajid Mannan

Abstract: A growing random graph is constructed by successively sampling without replacement an element from the pool of virtual vertices and edges. At start of the process the pool contains $N$ virtual vertices and no edges. Each time a vertex is sampled and occupied, the edges linking the vertex to previously occupied vertices are added to the pool of virtual elements. We focus on the edge-counting at tim… ▽ More A growing random graph is constructed by successively sampling without replacement an element from the pool of virtual vertices and edges. At start of the process the pool contains $N$ virtual vertices and no edges. Each time a vertex is sampled and occupied, the edges linking the vertex to previously occupied vertices are added to the pool of virtual elements. We focus on the edge-counting at times when the graph has $n\leq N$ occupied vertices. Two different Poisson limits are identified for $n\asymp N^{1/3}$ and $N-n\asymp 1$. For the bulk of the process, when $n\asymp N$, the scaled number of edges is shown to fluctuate about a deterministic curve, with fluctuations being of the order of $N^{3/2}$ and approximable by a Gaussian bridge. △ Less

Submitted 18 January, 2023; originally announced January 2023.

Comments: 21 pages, 1 figure

MSC Class: 05C80; 60B20

Journal ref: Bulletin of the London Mathematical Society 56, Issue 2 (2024) pp. 662-680

arXiv:2301.07483 [pdf, other]

Biases in Scholarly Recommender Systems: Impact, Prevalence, and Mitigation

Authors: Michael Färber, Melissa Coutinho, Shuzhou Yuan

Abstract: With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are subject to various biases. In this article, we first… ▽ More With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are subject to various biases. In this article, we first break down the biases of academic recommender systems and characterize them according to their impact and prevalence. In doing so, we distinguish between biases originally caused by humans and biases induced by the recommender system. Second, we provide an overview of methods that have been used to mitigate these biases in the scholarly domain. Based on this, third, we present a framework that can be used by researchers and developers to mitigate biases in scholarly recommender systems and to evaluate recommender systems fairly. Finally, we discuss open challenges and possible research directions related to scholarly biases. △ Less

Submitted 13 February, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

Comments: 44 pages, 6 figures. To be published in Scientometrics

arXiv:2301.07404 [pdf, other]

Large simplicial complexes: Universality, Randomness, and Ampleness

Authors: Michael Farber

Abstract: The paper surveys recent progress in understanding geometric, topological and combinatorial properties of large simplicial complexes, focusing mainly on ampleness, connectivity and universality. In the first part of the paper we concentrate on $r$-ample simplicial complexes which are high dimensional analogues of the $r$-e.c. graphs introduced originally by Erd\H os and Réniy. The class of $r$-amp… ▽ More The paper surveys recent progress in understanding geometric, topological and combinatorial properties of large simplicial complexes, focusing mainly on ampleness, connectivity and universality. In the first part of the paper we concentrate on $r$-ample simplicial complexes which are high dimensional analogues of the $r$-e.c. graphs introduced originally by Erd\H os and Réniy. The class of $r$-ample complexes is useful for applications since these complexes allow extensions of subcomplexes of certain type in all possible ways; besides, $r$-ample complexes exhibit remarkable robustness properties. We discuss results about the existence of $r$-ample complexes and describe their probabilistic and deterministic constructions. The properties of random simplicial complexes in medial regime are important for this discussion since these complexes are ample, in certain range. We prove that the topological complexity of a random simplicial complex in the medial regime satisfies ${\sf TC}(X)\le 4$, with probability tending to $1$ as $n\to\infty$. There exists a unique (up to isomorphism) $\infty$-ample complex on countable set of vertexes (the Rado complex), and the second part of the paper surveys the results about universality, homogeneity, indestructibility and other important properties of this complex. The Appendix written by J.A. Barmak discusses connectivity of conic and ample complexes. △ Less

Submitted 18 January, 2023; originally announced January 2023.

Comments: The Appendix was written by J.A. Barmak. arXiv admin note: text overlap with arXiv:2012.01483

arXiv:2301.02200 [pdf, other]

doi 10.1109/ICCRE57112.2023.10155607

Impact, Attention, Influence: Early Assessment of Autonomous Driving Datasets

Authors: Daniel Bogdoll, Jonas Hendl, Felix Schreyer, Nishanth Gowda, Michael Färber, J. Marius Zöllner

Abstract: Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has been conducted in other fields, it rarely revolves… ▽ More Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has been conducted in other fields, it rarely revolves around datasets. Thus, the impact, attention, and influence of datasets on autonomous driving remains a rarely investigated field. In this work, we provide a scientometric analysis for over 200 datasets in AD. We perform a rigorous evaluation of relations between available metadata and citation counts based on linear regression. Subsequently, we propose an Influence Score to assess a dataset already early on without the need for a track-record of citations, which is only available with a certain delay. △ Less

Submitted 31 March, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: Daniel Bogdoll and Jonas Hendl contributed equally. Accepted for publication at ICCRE 2023

arXiv:2212.11765 [pdf, other]

Predicting Companies' ESG Ratings from News Articles Using Multivariate Timeseries Analysis

Authors: Tanja Aue, Adam Jatowt, Michael Färber

Abstract: Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic approaches for forecasting ESG ratings have been… ▽ More Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic approaches for forecasting ESG ratings have been quite scarce despite the increasing importance of the topic. In this paper, we build a model to predict ESG ratings from news articles using the combination of multivariate timeseries construction and deep learning techniques. A news dataset for about 3,000 US companies together with their ratings is also created and released for training. Through the experimental evaluation we find out that our approach provides accurate results outperforming the state-of-the-art, and can be used in practice to support a manual determination or analysis of ESG ratings. △ Less

Submitted 13 November, 2022; originally announced December 2022.

arXiv:2212.01091 [pdf, other]

Sequential parametrized motion planning and its complexity, II

Authors: Michael Farber, Amit Kumar Paul

Abstract: This is a continuation of our recent paper in which we developed the theory of sequential parametrized motion planning. A sequential parametrized motion planning algorithm produced a motion of the system which is required to visit a prescribed sequence of states, in a certain order, at specified moments of time. In the previous publication we analysed the sequential parametrized topological comple… ▽ More This is a continuation of our recent paper in which we developed the theory of sequential parametrized motion planning. A sequential parametrized motion planning algorithm produced a motion of the system which is required to visit a prescribed sequence of states, in a certain order, at specified moments of time. In the previous publication we analysed the sequential parametrized topological complexity of the Fadell - Neuwirth fibration which in relevant to the problem of moving multiple robots avoiding collisions with other robots and with obstacles in the Euclidean space. Besides, in the preceeding paper we found the sequential parametrised topological complexity of the Fadell - Neuwirth bundle for the case of the Euclidean space $\Bbb R^d$ of odd dimension as well as the case $d=2$. In the present paper we give the complete answer for an arbitrary $d\ge 2$ even. Moreover, we present an explicit motion planning algorithm for controlling multiple robots in $\Bbb R^d$ having the minimal possible topological complexity; this algorithm is applicable to any number $n$ of robots and any number $m\ge 2$ of obstacles. △ Less

Submitted 2 December, 2022; originally announced December 2022.

MSC Class: 55M30

arXiv:2209.05418 [pdf, ps, other]

The homology of random simplicial complexes in the multi-parameter upper model

Authors: Michael Farber, Tahl Nowik

Abstract: We study random simplicial complexes in the multi-parameter upper model. In this model simplices of various dimensions are taken randomly and independently, and our random simplicial complex $Y$ is then taken to be the minimal simplicial complex containing this collection of simplices. We study the asymptotic behavior of the homology of $Y$ as the number of vertices goes to $\infty$. We observe… ▽ More We study random simplicial complexes in the multi-parameter upper model. In this model simplices of various dimensions are taken randomly and independently, and our random simplicial complex $Y$ is then taken to be the minimal simplicial complex containing this collection of simplices. We study the asymptotic behavior of the homology of $Y$ as the number of vertices goes to $\infty$. We observe the following phenomenon asymptotically almost surely. The given probabilities with which the simplices are taken determine a range of dimensions $\ell \leq k \leq \ell'$ with $\ell' \leq 2\ell +1$, outside of which the homology of $Y$ vanishes. Within this range, the homologies diminish drastically from dimension to dimension. In particular, the homology in the critical dimension $\ell$ is significantly the largest. △ Less

Submitted 12 September, 2022; originally announced September 2022.

arXiv:2209.01990 [pdf, ps, other]

Sequential parametrized topological complexity and related invariants

Authors: Michael Farber, John Oprea

Abstract: Parametrized motion planning algorithms \cite{CFW} have a high degree of universality and flexibility; they generate the motion of a robotic system under a variety of external conditions. The latter are viewed as parameters and constitute part of the input of the algorithm. The concept of sequential parametrized topological complexity ${\sf TC}_r[p:E\to B]$ is a measure of the complexity of such a… ▽ More Parametrized motion planning algorithms \cite{CFW} have a high degree of universality and flexibility; they generate the motion of a robotic system under a variety of external conditions. The latter are viewed as parameters and constitute part of the input of the algorithm. The concept of sequential parametrized topological complexity ${\sf TC}_r[p:E\to B]$ is a measure of the complexity of such algorithms. It was studied in \cite{CFW, CFW2} for $r=2$ and in \cite{FP} for $r\ge 2$. In this paper we analyse the dependence of the complexity ${\sf TC}_r[p:E\to B]$ on an initial bundle with structure group $G$ and on its fibre $X$ viewed as a $G$-space. Our main results estimate ${\sf TC}_r[p:E\to B]$ in terms of certain invariants of the bundle and the action on the fibre. Moreover, we also obtain estimates depending on the base and the fibre. Finally, we develop a calculus of sectional categories featuring a new invariant ${\sf secat}_f[p:E\to B]$ which plays an important role in the study of sectional category of towers of fibrations. △ Less

Submitted 26 February, 2023; v1 submitted 5 September, 2022; originally announced September 2022.

MSC Class: 55M30

arXiv:2206.07688 [pdf, other]

Spectra of infinite graphs with summable weight functions

Authors: Michael Farber, Lewin Strauss

Abstract: In this paper we study spectra of Laplacians of infinite weighted graphs. Instead of the assumption of local finiteness we impose the condition of summability of the weight function. Such graphs correspond to reversible Markov chains with countable state spaces. We adopt the concept of the Cheeger constant to this setting and prove an analogue of the Cheeger inequality characterising the spectral… ▽ More In this paper we study spectra of Laplacians of infinite weighted graphs. Instead of the assumption of local finiteness we impose the condition of summability of the weight function. Such graphs correspond to reversible Markov chains with countable state spaces. We adopt the concept of the Cheeger constant to this setting and prove an analogue of the Cheeger inequality characterising the spectral gap. We also analyse the concept of the dual Cheeger constant originally introduced in \cite{B14}, which allows estimating the top of the spectrum. In this paper we also introduce a new combinatorial invariant, k$(G,m)$, which allows a complete characterisation of bipartite graphs and measures the asymmetry of the spectrum (the Hausdorff distance between the spectrum and its reflection at point $1\in \Bbb R$). We compare k$(G, m)$ to the Cheeger and the dual Cheeger constants. Finally, we analyse in full detail a class of infinite complete graphs and their spectra. △ Less

Submitted 25 August, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

MSC Class: 05C50; 05C63; 05C48; 05C81

arXiv:2205.08453 [pdf, ps, other]

Sequential Parametrized Motion Planning and its Complexity

Authors: Michael Farber, Amit Kumar Paul

Abstract: In this paper we develop theory of sequential parametrized motion planning which generalises the approach of parametrized motion planning, which was introduced recently in [3]. A sequential parametrized motion planning algorithm produced a motion of the system which is required to visit a prescribed sequence of states, in certain order, at specified moments of time. The sequential parametrized alg… ▽ More In this paper we develop theory of sequential parametrized motion planning which generalises the approach of parametrized motion planning, which was introduced recently in [3]. A sequential parametrized motion planning algorithm produced a motion of the system which is required to visit a prescribed sequence of states, in certain order, at specified moments of time. The sequential parametrized algorithms are universal as the external conditions are not fixed in advance but rather constitute part of the input of the algorithm. The second part of this article consists of a detailed analysis of the sequential parametrized topological complexity of the Fadell - Neuwirth fibration. In the language of robotics, sections of the Fadell - Neuwitrh fibration are algorithms for moving multiple robots avoiding collisions with other robots and with obstacles in Euclidean space. In the last section of the paper we introduce the new notion of TC-generating function of a fibration, examine examples and raise some general questions about its analytic properties. △ Less

Submitted 17 September, 2022; v1 submitted 17 May, 2022; originally announced May 2022.

MSC Class: 55M30

arXiv:2205.02048 [pdf, other]

Few-Shot Document-Level Relation Extraction

Authors: Nicholas Popovic, Michael Färber

Abstract: We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and construct a benchmark based on two existing super… ▽ More We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and construct a benchmark based on two existing supervised learning data sets, DocRED and sciERC. We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation. We find FSDLRE to be a challenging setting with interesting new characteristics such as the ability to sample NOTA instances from the support set. The data, code, and trained models are available online (https://github.com/nicpopovic/FREDo). △ Less

Submitted 1 July, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: Published at NAACL 2022

arXiv:2205.02033 [pdf, other]

doi 10.1145/3529372.3530953

How Does Author Affiliation Affect Preprint Citation Count? Analyzing Citation Bias at the Institution and Country Level

Authors: Chifumi Nishioka, Michael Färber, Tarek Saier

Abstract: Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not… ▽ More Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprints are on the rise, affiliation-based citation bias is, thus, an important topic not only for authors (e.g., when deciding what to cite), but also to people and institutions that use citations for scientific impact quantification (e.g., funding agencies deciding about funding based on citation counts). △ Less

Submitted 4 May, 2022; originally announced May 2022.

Comments: Accepted at the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022

arXiv:2203.05325 [pdf, other]

AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First -- Using Relation Extraction to Identify Entities

Authors: Nicholas Popovic, Walter Laurito, Michael Färber

Abstract: In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entit… ▽ More In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction. This means that the system can be trained even on data sets where only a subset of all valid entity spans is annotated. We provide an extensive evaluation of the proposed system and its strengths and weaknesses. Our approach, which can be scaled dynamically in computational complexity at inference time, produces predictions with high precision and reaches 3rd place in the leaderboard of SemEval-2022 Task 12. For inputs in the domain of physics and math, it achieves high relation extraction macro F1 scores of 95.43% and 79.17%, respectively. The code used for training and evaluating our models is available at: https://github.com/nicpopovic/RE1st △ Less

Submitted 4 May, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

Comments: Camera ready version

arXiv:2202.05801 [pdf, other]

Parametrized motion planning and topological complexity

Authors: Michael Farber, Shmuel Weinberger

Abstract: In this paper we study paramertized motion planning algorithms which provide universal and flexible solutions to diverse motion planning problems. Such algorithms are intended to function under a variety of external conditions which are viewed as parameters and serve as part of the input of the algorithm. Continuing a recent paper, we study further the concept of parametrized topological complexit… ▽ More In this paper we study paramertized motion planning algorithms which provide universal and flexible solutions to diverse motion planning problems. Such algorithms are intended to function under a variety of external conditions which are viewed as parameters and serve as part of the input of the algorithm. Continuing a recent paper, we study further the concept of parametrized topological complexity. We analyse in full detail the problem of controlling a swarm of robots in the presence of multiple obstacles in Euclidean space which served for us a natural motivating example. We present an explicit parametrized motion planning algorithm solving the motion planning problem for any number of robots and obstacles.. This algorithm is optimal, it has minimal possible topological complexity for any d odd. Besides, we describe a modification of this algorithm which is optimal for d even. We also analyse the parametrized topological complexity of sphere bundles using the Stiefel - Whitney characteristic classes. △ Less

Submitted 23 February, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

MSC Class: 58E05

arXiv:2202.05796 [pdf, ps, other]

Parametrized topological complexity of sphere bundles

Authors: Michael Farber, Shmuel Weinberger

Abstract: Parametrized motion planning algorithms have high degree of flexibility and universality, they can work under a variety of external conditions, which are viewed as parameters and form part of the input of the algorithm. In this paper we analyse the parameterized motion planning problem in the case of sphere bundles. Our main results provide upper and lower bounds for the parametrized topological c… ▽ More Parametrized motion planning algorithms have high degree of flexibility and universality, they can work under a variety of external conditions, which are viewed as parameters and form part of the input of the algorithm. In this paper we analyse the parameterized motion planning problem in the case of sphere bundles. Our main results provide upper and lower bounds for the parametrized topological complexity; the upper bounds typically involve sectional categories of the associated fibrations and the lower bounds are given in terms of characteristic classes and their properties. We explicitly compute the parametrized topological complexity in many examples and show that it may assume arbitrarily large values. △ Less

Submitted 12 May, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

MSC Class: 58E05

arXiv:2112.00859 [pdf, other]

Are Investors Biased Against Women? Analyzing How Gender Affects Startup Funding in Europe

Authors: Michael Färber, Alexander Klein

Abstract: One of the main challenges of startups is to raise capital from investors. For startup founders, it is therefore crucial to know whether investors have a bias against women as startup founders and in which way startups face disadvantages due to gender bias. Existing works on gender studies have mainly analyzed the US market. In this paper, we aim to give a more comprehensive picture of gender bias… ▽ More One of the main challenges of startups is to raise capital from investors. For startup founders, it is therefore crucial to know whether investors have a bias against women as startup founders and in which way startups face disadvantages due to gender bias. Existing works on gender studies have mainly analyzed the US market. In this paper, we aim to give a more comprehensive picture of gender bias in early-stage startup funding. We examine European startups listed on Crunchbase using Semantic Web technologies and analyze how the share of female founders in a founding team affects the funding amount. We find that the relative amount of female founders has a negative impact on the funding raised. Furthermore, we observe that founder characteristics have an effect on the funding raised based on the founders' gender. Moreover, we find that gender bias in early-stage funding is less prevalent for serial founders with entrepreneurial experience as female founders benefit three times more than male founders from already having founded a startup. Overall, our study suggests that gender bias exists and is worth to be considered in the context of startup funding. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: 35 pages

arXiv:2112.00160 [pdf, other]

Towards Full-Fledged Argument Search: A Framework for Extracting and Clustering Arguments from Unstructured Text

Authors: Michael Färber, Anna Steyer

Abstract: Argument search aims at identifying arguments in natural language texts. In the past, this task has been addressed by a combination of keyword search and argument identification on the sentence- or document-level. However, existing frameworks often address only specific components of argument search and do not address the following aspects: (1) argument-query matching: identifying arguments that f… ▽ More Argument search aims at identifying arguments in natural language texts. In the past, this task has been addressed by a combination of keyword search and argument identification on the sentence- or document-level. However, existing frameworks often address only specific components of argument search and do not address the following aspects: (1) argument-query matching: identifying arguments that frame the topic slightly differently than the actual search query; (2) argument identification: identifying arguments that consist of multiple sentences; (3) argument clustering: selecting retrieved arguments by topical aspects. In this paper, we propose a framework for addressing these shortcomings. We suggest (1) to combine the keyword search with precomputed topic clusters for argument-query matching, (2) to apply a novel approach based on sentence-level sequence-labeling for argument identification, and (3) to present aggregated arguments to users based on topic-aware argument clustering. Our experiments on several real-world debate data sets demonstrate that density-based clustering algorithms, such as HDBSCAN, are particularly suitable for argument-query matching. With our sentence-level, BiLSTM-based sequence-labeling approach we achieve a macro F1 score of 0.71. Finally, evaluating our argument clustering method indicates that a fine-grained clustering of arguments by subtopics remains challenging but is worthwhile to be explored. △ Less

Submitted 30 November, 2021; originally announced December 2021.

arXiv:2111.05097 [pdf, other]

doi 10.1007/s00799-021-00312-z

Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Usage, and Impact

Authors: Tarek Saier, Michael Färber, Tornike Tsereteli

Abstract: Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in dat… ▽ More Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available. △ Less

Submitted 10 November, 2021; v1 submitted 7 November, 2021; originally announced November 2021.

Comments: to be published in the International Journal on Digital Libraries

ACM Class: H.3.3; H.3.7; I.2.7

arXiv:2109.09389 [pdf, other]

Explaining Convolutional Neural Networks by Tagging Filters

Authors: Anna Nguyen, Daniel Hagenmayer, Tobias Weller, Michael Färber

Abstract: Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which are not very intuitive for non-experts in analyzin… ▽ More Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which are not very intuitive for non-experts in analyzing a CNN classification. In this paper, we propose FilTag, an approach to effectively explain CNNs even to non-experts. The idea is that when images of a class frequently activate a convolutional filter, then that filter is tagged with that class. These tags provide an explanation to a reference of a class-specific feature detected by the filter. Based on the tagging, individual image classifications can then be intuitively explained in terms of the tags of the filters that the input image activates. Finally, we show that the tags are helpful in analyzing classification errors caused by noisy input images and that the tags can be further processed by machines. △ Less

Submitted 20 September, 2021; originally announced September 2021.

arXiv:2106.13722 [pdf, other]

A Curiously Effective Backtracking Strategy for Connection Tableaux

Authors: Michael Färber

Abstract: Automated proof search with connection tableaux, such as implemented by Otten's leanCoP prover, depends on backtracking for completeness. Otten's restricted backtracking strategy loses completeness, yet for many problems, it significantly reduces the time required to find a proof. I introduce a new, less restricted backtracking strategy based on the notion of exclusive cuts. I implement the strate… ▽ More Automated proof search with connection tableaux, such as implemented by Otten's leanCoP prover, depends on backtracking for completeness. Otten's restricted backtracking strategy loses completeness, yet for many problems, it significantly reduces the time required to find a proof. I introduce a new, less restricted backtracking strategy based on the notion of exclusive cuts. I implement the strategy in a new prover called meanCoP and show that it greatly improves upon the previous best strategy in leanCoP. △ Less

Submitted 16 January, 2024; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: Accepted at AReCCa 2023

ACM Class: F.4.1

arXiv:2102.08766 [pdf, other]

doi 10.1145/3497775.3503683

Safe, Fast, Concurrent Proof Checking for the lambda-Pi Calculus Modulo Rewriting

Authors: Michael Färber

Abstract: Several proof assistants, such as Isabelle or Coq, can concurrently check multiple proofs. In contrast, the vast majority of today's small proof checkers either does not support concurrency at all or only limited forms thereof, restricting the efficiency of proof checking on multi-core processors. This work shows the design of a small, memory- and thread-safe kernel that efficiently checks proofs… ▽ More Several proof assistants, such as Isabelle or Coq, can concurrently check multiple proofs. In contrast, the vast majority of today's small proof checkers either does not support concurrency at all or only limited forms thereof, restricting the efficiency of proof checking on multi-core processors. This work shows the design of a small, memory- and thread-safe kernel that efficiently checks proofs both concurrently and non-concurrently. This design is implemented in a new proof checker called Kontroli for the lambda-Pi calculus modulo rewriting, which is an established framework to uniformly express a multitude of logical systems. Kontroli is faster than the reference proof checker for this calculus, Dedukti, on all of five evaluated datasets obtained from proof assistants and interactive theorem provers. Furthermore, Kontroli reduces the time of the most time-consuming part of proof checking using eight threads by up to 6.6x. △ Less

Submitted 3 March, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

Comments: 11th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP '22), Jan 2022, Philadelphia, PA, United States

arXiv:2012.01483 [pdf, other]

doi 10.1007/s40879-021-00521-5

Ample simplicial complexes

Authors: Chaim Even-Zohar, Michael Farber, Lewis Mead

Abstract: Motivated by potential applications in network theory, engineering and computer science, we study $r$-ample simplicial complexes. These complexes can be viewed as finite approximations to the Rado complex which has a remarkable property of {\it indestructibility,} in the sense that removing any finite number of its simplexes leaves a complex isomorphic to itself. We prove that an $r$-ample simplic… ▽ More Motivated by potential applications in network theory, engineering and computer science, we study $r$-ample simplicial complexes. These complexes can be viewed as finite approximations to the Rado complex which has a remarkable property of {\it indestructibility,} in the sense that removing any finite number of its simplexes leaves a complex isomorphic to itself. We prove that an $r$-ample simplicial complex is simply connected and $2$-connected for $r$ large. The number $n$ of vertexes of an $r$-ample simplicial complex satisfies $\exp(Ω(\frac{2^r}{\sqrt{r}}))$. We use the probabilistic method to establish the existence of $r$-ample simplicial complexes with $n$ vertexes for any $n>r 2^r 2^{2^r}$. Finally, we introduce the iterated Paley simplicial complexes, which are explicitly constructed $r$-ample simplicial complexes with nearly optimal number of vertexes. △ Less

Submitted 2 December, 2020; originally announced December 2020.

Journal ref: European Journal of Mathematics, 2022, 8, 1-32

arXiv:2010.09809 [pdf, ps, other]

Parametrized topological complexity of collision-free motion planning in the plane

Authors: Daniel C. Cohen, Michael Farber, Shmuel Weinberger

Abstract: Parametrized motion planning algorithms have high degrees of universality and flexibility, as they are designed to work under a variety of external conditions, which are viewed as parameters and form part of the input of the underlying motion planning problem. In this paper, we analyze the parameterized motion planning problem for the motion of many distinct points in the plane, moving without col… ▽ More Parametrized motion planning algorithms have high degrees of universality and flexibility, as they are designed to work under a variety of external conditions, which are viewed as parameters and form part of the input of the underlying motion planning problem. In this paper, we analyze the parameterized motion planning problem for the motion of many distinct points in the plane, moving without collision and avoiding multiple distinct obstacles with a priori unknown positions. This complements our prior work [arXiv:2009.06023], where parameterized motion planning algorithms were introduced, and the obstacle-avoiding collision-free motion planning problem in three-dimensional space was fully investigated. The planar case requires different algebraic and topological tools than its spatial analog. △ Less

Submitted 14 October, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: revision includes an appendix on fibrations of certain map** spaces

MSC Class: 55S40; 55M30; 55R80; 70Q05

arXiv:2009.06023 [pdf, ps, other]

doi 10.1137/20M1358505

Topology of parametrised motion planning algorithms

Authors: Daniel C. Cohen, Michael Farber, Shmuel Weinberger

Abstract: In this paper we introduce and study a new concept of parametrised topological complexity, a topological invariant motivated by the motion planning problem of robotics. In the parametrised setting, a motion planning algorithm has high degree of universality and flexibility, it can function under a variety of external conditions (such as positions of the obstacles etc). We explicitly compute the pa… ▽ More In this paper we introduce and study a new concept of parametrised topological complexity, a topological invariant motivated by the motion planning problem of robotics. In the parametrised setting, a motion planning algorithm has high degree of universality and flexibility, it can function under a variety of external conditions (such as positions of the obstacles etc). We explicitly compute the parameterised topological complexity of obstacle-avoiding collision-free motion of many particles (robots) in 3-dimensional space. Our results show that the parameterised topological complexity can be significantly higher than the standard (nonparametrised) invariant. △ Less

Submitted 21 May, 2021; v1 submitted 13 September, 2020; originally announced September 2020.

Comments: To appear in SIAM Journal of Applied Algebra and Geometry

MSC Class: 55S40; 55M30; 55R80; 70Q05

Journal ref: SIAM Journal on Applied Algebra and Geometry, 5 (2021), 229-249

arXiv:2007.11924 [pdf, other]

Right for the Right Reason: Making Image Classification Robust

Authors: Anna Nguyen, Adrian Oberföll, Michael Färber

Abstract: The effectiveness of Convolutional Neural Networks (CNNs)in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reasons,i.e., based on incidental evidence. Of course, i… ▽ More The effectiveness of Convolutional Neural Networks (CNNs)in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reasons,i.e., based on incidental evidence. Of course, it is desirable that images are classified correctly for the right reasons, i.e., based on the actual evidence. To this end, we propose a new explanation quality metric to measure object aligned explanation in image classification which we refer to as theObAlExmetric. Using object detection approaches, explanation approaches, and ObAlEx, we quantify the focus of CNNs on the actual evidence. Moreover, we show that additional training of the CNNs can improve the focus of CNNs without decreasing their accuracy. △ Less

Submitted 12 January, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

Showing 1–50 of 163 results for author: Färber, M