-
WikiTexVC: MediaWiki's native LaTeX to MathML converter for Wikipedia
Authors:
Johannes Stegmüller,
Moritz Schubotz
Abstract:
MediaWiki and Wikipedia authors usually use LaTeX to define mathematical formulas in the wiki text markup. In the Wikimedia ecosystem, these formulas were processed by a long cascade of web services and finally delivered to users' browsers in rendered form for visually readable representation as SVG.
With the latest developments of supporting MathML Core in Chromium-based browsers, MathML contin…
▽ More
MediaWiki and Wikipedia authors usually use LaTeX to define mathematical formulas in the wiki text markup. In the Wikimedia ecosystem, these formulas were processed by a long cascade of web services and finally delivered to users' browsers in rendered form for visually readable representation as SVG.
With the latest developments of supporting MathML Core in Chromium-based browsers, MathML continues its path to be a de facto standard markup language for mathematical notation in the web. Conveying formulas in MathML enables semantic annotation and machine readability for extended interpretation of mathematical content, in example for accessibility technologies.
With this work, we present WikiTexVC, a novel method for validating LaTeX formulas from wiki texts and converting them to MathML, which is directly integrated into MediaWiki. This mitigates the shortcomings of previously used rendering methods in MediaWiki in terms of robustness, maintainability and performance. In addition, there is no need for a multitude of web services running in the background, but processing takes place directly within MediaWiki instances. We validated this method with an extended dataset of over 300k formulas which have been incorporated as automated tests to the MediaWiki continuous integration instances. Furthermore, we conducted an evaluation with 423 formulas, comparing the tree edit distance for produced parse trees to other MathML renderers. Our method has been made available Open Source and can be used on German Wikipedia and is delivered with recent MediaWiki versions. As a practical example of enabling semantic annotations within our method, we present a new macro that adds content to formula disambiguation to facilitate accessibility for visually impaired people.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Making Mathematical Research Data FAIR: A Technology Overview
Authors:
Tim Conrad,
Eloi Ferrer,
Daniel Mietchen,
Larissa Pusch,
Johannes Stegmuller,
Moritz Schubotz
Abstract:
The sharing and citation of research data is becoming increasingly recognized as an essential building block in scientific research across various fields and disciplines. Sharing research data allows other researchers to reproduce results, replicate findings, and build on them. Ultimately, this will foster faster cycles in knowledge generation. Some disciplines, such as astronomy or bioinformatics…
▽ More
The sharing and citation of research data is becoming increasingly recognized as an essential building block in scientific research across various fields and disciplines. Sharing research data allows other researchers to reproduce results, replicate findings, and build on them. Ultimately, this will foster faster cycles in knowledge generation. Some disciplines, such as astronomy or bioinformatics, already have a long history of sharing data; many others do not. The current landscape of so-called research data repositories is diverse. This review aims to perform a technology review on existing data repositories/portals with a focus on mathematical research data.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Bravo MaRDI: A Wikibase Powered Knowledge Graph on Mathematics
Authors:
Moritz Schubotz,
Eloi Ferrer,
Johannes Stegmüller,
Daniel Mietchen,
Olaf Teschke,
Larissa Pusch,
Tim OF Conrad
Abstract:
Mathematical world knowledge is a fundamental component of Wikidata. However, to date, no expertly curated knowledge graph has focused specifically on contemporary mathematics. Addressing this gap, the Mathematical Research Data Initiative (MaRDI) has developed a comprehensive knowledge graph that links multimodal research data in mathematics. This encompasses traditional research data items like…
▽ More
Mathematical world knowledge is a fundamental component of Wikidata. However, to date, no expertly curated knowledge graph has focused specifically on contemporary mathematics. Addressing this gap, the Mathematical Research Data Initiative (MaRDI) has developed a comprehensive knowledge graph that links multimodal research data in mathematics. This encompasses traditional research data items like datasets, software, and publications and includes semantically advanced objects such as mathematical formulas and hypotheses. This paper details the abilities of the MaRDI knowledge graph, which is based on Wikibase, leading up to its inaugural public release, codenamed Bravo, available on https://portal.mardi4nfdi.de.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Detecting Cross-Language Plagiarism using Open Knowledge Graphs
Authors:
Johannes Stegmüller,
Fabian Bauer-Marquart,
Norman Meuschke,
Terry Ruas,
Moritz Schubotz,
Bela Gipp
Abstract:
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require compu…
▽ More
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.
△ Less
Submitted 16 December, 2021; v1 submitted 18 November, 2021;
originally announced November 2021.