Search | arXiv e-print repository

AutoShard -- Declaratively Managing Hot Spot Data Objects in NoSQL Document Stores

Authors: Stefanie Scherzinger, Andreas Thor

Abstract: NoSQL document stores are becoming increasingly popular as backends in web development. Not only do they scale out to large volumes of data, many systems are even custom-tailored for this domain: NoSQL document stores like Google Cloud Datastore have been designed to support massively parallel reads, and even guarantee strong consistency in updating single data objects. However, strongly consisten… ▽ More NoSQL document stores are becoming increasingly popular as backends in web development. Not only do they scale out to large volumes of data, many systems are even custom-tailored for this domain: NoSQL document stores like Google Cloud Datastore have been designed to support massively parallel reads, and even guarantee strong consistency in updating single data objects. However, strongly consistent updates cannot be implemented arbitrarily fast in large-scale distributed systems. Consequently, data objects that experience high-frequent writes can turn into severe performance bottlenecks. In this paper, we present AutoShard, a ready-to-use object mapper for Java applications running against NoSQL document stores. AutoShard's unique feature is its capability to gracefully shard hot spot data objects to avoid write contention. Using AutoShard, developers can easily handle hot spot data objects by adding minimally intrusive annotations to their application code. Our experiments show the significant impact of sharding on both the write throughput and the execution time. △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: Published at WebDB 2014

Journal ref: WebDB 2014

arXiv:2104.12557 [pdf, other]

Flexible Educational Software Architecture

Authors: Roy Meissner, Andreas Thor

Abstract: EAs.LiT is an e-assessment management and analysis software for which contextual requirements and usage scenarios changed over time. Based on these factors and further development activities, the decision was made to adopt a microservice architecture for EAs.LiT version 2 in order to increase its flexibility to adapt to new and changed circumstances. This architectural style and a few adopted tech… ▽ More EAs.LiT is an e-assessment management and analysis software for which contextual requirements and usage scenarios changed over time. Based on these factors and further development activities, the decision was made to adopt a microservice architecture for EAs.LiT version 2 in order to increase its flexibility to adapt to new and changed circumstances. This architectural style and a few adopted technologies, like RDF as a data format, enabled an eased implementation of various use cases. Thus we consider the microservice architecture productive and recommend it for usage in other educational projects. The specific architecture of EAs.LiT version 2 is presented within this article, targeting to enable other educational projects to adopt it by using our work as a foundation or template. △ Less

Submitted 16 April, 2021; originally announced April 2021.

Comments: 8 pages, 4 figures, presented at the Workshop Intelligence Support for Mentoring Processes in Higher Education (IMHE) at ITS 2020, to be published in CEUR-WS Proceedings

arXiv:1901.08775 [pdf]

Which are the influential publications in the Web of Science subject categories over a long period of time? CRExplorer software used for big-data analyses in bibliometrics

Authors: Andreas Thor, Lutz Bornmann, Robin Haunschild, Loet Leydesdorff

Abstract: What are the landmark papers in scientific disciplines? On whose shoulders does research in these fields stand? Which papers are indispensable for scientific progress? These are typical questions which are not only of interest for researchers (who frequently know the answers - or guess to know them), but also for the interested general public. Citation counts can be used to identify very useful pa… ▽ More What are the landmark papers in scientific disciplines? On whose shoulders does research in these fields stand? Which papers are indispensable for scientific progress? These are typical questions which are not only of interest for researchers (who frequently know the answers - or guess to know them), but also for the interested general public. Citation counts can be used to identify very useful papers, since they reflect the wisdom of the crowd; in this case, the scientists using the published results for their own research. In this study, we identified with recently developed methods for the program CRExplorer landmark publications in nearly all Web of Science subject categories (WoSSCs). These are publications which belong more frequently than other publications across the citing years to the top-per mill in their subject category. The results for three subject categories "Information Science and Library Science", "Computer Science, Information Systems", and "Computer Science, Software Engineering" are exemplarily discussed in more detail. The results for the other WoSSCs can be found online at http://crexplorer.net. △ Less

Submitted 25 January, 2019; originally announced January 2019.

arXiv:1807.04673 [pdf]

doi 10.1177/0165551519837175

How to identify the roots of broad research topics and fields? The introduction of RPYS sampling using the example of climate change research

Authors: Robin Haunschild, Werner Marx, Andreas Thor, Lutz Bornmann

Abstract: Since the introduction of the reference publication year spectroscopy (RPYS) method and the corresponding program CRExplorer, many studies have been published revealing the historical roots of topics, fields, and researchers. The application of the method was restricted up to now by the available memory of the computer used for running the CRExplorer. Thus, many users could not perform RPYS for br… ▽ More Since the introduction of the reference publication year spectroscopy (RPYS) method and the corresponding program CRExplorer, many studies have been published revealing the historical roots of topics, fields, and researchers. The application of the method was restricted up to now by the available memory of the computer used for running the CRExplorer. Thus, many users could not perform RPYS for broader research fields or topics. In this study, we present various sampling methods to solve this problem: random, systematic, and cluster sampling. We introduce the script language of the CRExplorer which can be used to draw many samples from the population dataset. Based on a large dataset of publications from climate change research, we compare RPYS results using population data with RPYS results using different sampling techniques. From our comparison with the full RPYS (population spectrogram), we conclude that the cluster sampling performs worst and the systematic sampling performs best. The random sampling also performs very well but not as well as the systematic sampling. The study therefore demonstrates the fruitfulness of the sampling approach for applying RPYS. △ Less

Submitted 12 July, 2018; originally announced July 2018.

Comments: 30 pages, 6 figures, 3 script listings, and 4 tables

Journal ref: Journal of Information Science, in print (2019), DOI 10.1177/0165551519837175

arXiv:1801.08720 [pdf]

Identifying single influential publications in a research field: New analysis opportunities of the CRExplorer

Authors: Andreas Thor, Lutz Bornmann, Werner Marx, Rüdiger Mutz

Abstract: Reference Publication Year Spectroscopy (RPYS) has been developed for identifying the cited references (CRs) with the greatest influence in a given paper set (mostly sets of papers on certain topics or fields). The program CRExplorer (see www.crexplorer.net) was specifically developed by Thor, Marx, Leydesdorff, and Bornmann (2016a, 2016b) for applying RPYS to publication sets downloaded from Scop… ▽ More Reference Publication Year Spectroscopy (RPYS) has been developed for identifying the cited references (CRs) with the greatest influence in a given paper set (mostly sets of papers on certain topics or fields). The program CRExplorer (see www.crexplorer.net) was specifically developed by Thor, Marx, Leydesdorff, and Bornmann (2016a, 2016b) for applying RPYS to publication sets downloaded from Scopus or Web of Science. In this study, we present some advanced methods which have been newly developed for CRExplorer. These methods are able to identify and characterize the CRs which have been influential across a longer period (many citing years). The new methods are demonstrated in this study using all the papers published in Scientometrics between 1978 and 2016. The indicators N_TOP50, N_TOP25, and N_TOP10 can be used to identify those CRs which belong to the 50%, 25%, or 10% most frequently cited publications (CRs) over many citing publication years. In the Scientometrics dataset, for example, Lotka's (1926) paper on the distribution of scientific productivity belongs to the top 10% publications (CRs) in 36 citing years. Furthermore, the new version of CRExplorer analyzes the impact sequence of CRs across citing years. CRs can have below average (-), average (0), or above average (+) impact in citing years (whereby average is meant in the sense of expected values). The sequence (e.g. 00++---0--00) is used by the program to identify papers with typical impact distributions. For example, CRs can have early, but not late impact ("hot papers", e.g. +++---) or vice versa ("slee** beauties", e.g. ---0000---++). △ Less

Submitted 21 March, 2018; v1 submitted 26 January, 2018; originally announced January 2018.

arXiv:1608.07960 [pdf]

doi 10.1007/s11192-016-2115-y

Which early works are cited most frequently in climate change research literature? A bibliometric approach based on Reference Publication Year Spectroscopy

Authors: Werner Marx, Robin Haunschild, Andreas Thor, Lutz Bornmann

Abstract: This bibliometric analysis focuses on the general history of climate change research and, more specifically, on the discovery of the greenhouse effect. First, the Reference Publication Year Spectroscopy (RPYS) is applied to a large publication set on climate change of 222,060 papers published between 1980 and 2014. The references cited therein were extracted and analyzed with regard to publication… ▽ More This bibliometric analysis focuses on the general history of climate change research and, more specifically, on the discovery of the greenhouse effect. First, the Reference Publication Year Spectroscopy (RPYS) is applied to a large publication set on climate change of 222,060 papers published between 1980 and 2014. The references cited therein were extracted and analyzed with regard to publications, which are cited most frequently. Second, a new method for establishing a more subject-specific publication set for applying RPYS (based on the co-citations of a marker reference) is proposed (RPYS-CO). The RPYS of the climate change literature focuses on the history of climate change research in total. We identified 35 highly-cited publications across all disciplines, which include fundamental early scientific works of the 19th century (with a weak connection to climate change) and some cornerstones of science with a stronger connection to climate change. By using the Arrhenius (1896) paper as a RPYS-CO marker paper, we selected only publications specifically discussing the discovery of the greenhouse effect and the role of carbon dioxide. Also, we focused on the time period 1800-1850 to reveal the contributions of J.B.J Fourier in terms of cited references. Using different RPYS approaches in this study, we were able to identify the complete range of works of the celebrated icons as well as many less known works relevant for the history of climate change research. The analyses confirmed the potential of the RPYS method for historical studies: Seminal papers are detected on the basis of the references cited by the overall community without any further assumptions. △ Less

Submitted 28 October, 2016; v1 submitted 29 August, 2016; originally announced August 2016.

Comments: in press at Scientometrics

arXiv:1607.01266 [pdf]

New features of CitedReferencesExplorer (CRExplorer)

Authors: Andreas Thor, Werner Marx, Loet Leydesdorff, Lutz Bornmann

Abstract: CRExplorer version 1.6.7 was released on July 5, 2016. This version includes the following new features and improvements: Scopus: Using "File" - "Import" - "Scopus", CRExplorer reads files from Scopus. The file format "CSV" (including citations, abstracts and references) should be chosen in Scopus for downloading records. Export facilities: Using "File" - "Export" - "Scopus", CRExplorer exports fi… ▽ More CRExplorer version 1.6.7 was released on July 5, 2016. This version includes the following new features and improvements: Scopus: Using "File" - "Import" - "Scopus", CRExplorer reads files from Scopus. The file format "CSV" (including citations, abstracts and references) should be chosen in Scopus for downloading records. Export facilities: Using "File" - "Export" - "Scopus", CRExplorer exports files in the Scopus format. Using "File" - "Export" - "Web of Science", CRExplorer exports files in the Web of Science format. These files can be imported in other bibliometric programs (e.g. VOSviewer). Space bar: Select a specific cited reference in the cited references table, press the space bar, and all bibliographic details of the CR are shown. Internal file format: Using "File" - "Save", working files are saved in the internal file format "*.cre". The files include all data including matching results and manual matching corrections. The files can be opened by using "File" - "Open". △ Less

Submitted 20 July, 2016; v1 submitted 5 July, 2016; originally announced July 2016.

Comments: Accepted for publication in Scientometrics

arXiv:1604.04705 [pdf]

Referenced Publication Year Spectroscopy (RPYS) and Algorithmic Historiography: The Bibliometric Reconstruction of András Schubert's Œuvre

Authors: Loet Leydesdorff, Lutz Bornmann, Jordan Comins, Werner Marx, Andreas Thor

Abstract: Referenced Publication Year Spectroscopy (RPYS) was recently introduced as a method to analyze the historical roots of research fields and groups or institutions. RPYS maps the distribution of the publication years of the cited references in a document set. In this study, we apply this methodology to the œuvre of an individual researcher on the occasion of a Festschrift for András Schubert's 70th… ▽ More Referenced Publication Year Spectroscopy (RPYS) was recently introduced as a method to analyze the historical roots of research fields and groups or institutions. RPYS maps the distribution of the publication years of the cited references in a document set. In this study, we apply this methodology to the œuvre of an individual researcher on the occasion of a Festschrift for András Schubert's 70th birthday. We discuss the different options of RPYS in relation to one another (e.g. Multi-RPYS), and in relation to the longer-term research program of algorithmic historiography (e.g., HistCite) based on Schubert's publications (n=172) and cited references therein as a bibliographic domain in scientometrics. Main path analysis and Multi-RPYS of the citation network are used to show the changes and continuities in Schubert's intellectual career. Diachronic and static decomposition of a document set can lead to different results, while the analytically distinguishable lines of research may overlap and interact over time, and intermittent. △ Less

Submitted 16 April, 2016; originally announced April 2016.

Comments: Leydesdorff, L., Bornmann, L., Comins, J. A., Marx, W., & Thor, A. (2016). Referenced Publication Year Spectrography (RPYS) and Algorithmic Historiography: A Bibliometric Reconstruction of András Schubert's Œuvre. In W. Glänzel & B. Schlemmer (Eds.), András Schubert--A World of Models and Metrics (pp. 79-96). Louvain: ISSI

arXiv:1601.01199 [pdf]

Introducing CitedReferencesExplorer (CRExplorer): A program for Reference Publication Year Spectroscopy with Cited References Standardization

Authors: Andreas Thor, Werner Marx, Loet Leydesdorff, Lutz Bornmann

Abstract: We introduce a new tool - the CitedReferencesExplorer (CRExplorer, www.crexplorer.net) - which can be used to disambiguate and analyze the cited references (CRs) of a publication set downloaded from the Web of Science (WoS). The tool is especially suitable to identify those publications which have been frequently cited by the researchers in a field and thereby to study for example the historical r… ▽ More We introduce a new tool - the CitedReferencesExplorer (CRExplorer, www.crexplorer.net) - which can be used to disambiguate and analyze the cited references (CRs) of a publication set downloaded from the Web of Science (WoS). The tool is especially suitable to identify those publications which have been frequently cited by the researchers in a field and thereby to study for example the historical roots of a research field or topic. CRExplorer simplifies the identification of key publications by enabling the user to work with both a graph for identifying most frequently cited reference publication years (RPYs) and the list of references for the RPYs which have been most frequently cited. A further focus of the program is on the standardization of CRs. It is a serious problem in bibliometrics that there are several variants of the same CR in the WoS. In this study, CRExplorer is used to study the CRs of all papers published in the Journal of Informetrics. The analyses focus on the most important papers published between 1980 and 1990. △ Less

Submitted 16 February, 2016; v1 submitted 6 January, 2016; originally announced January 2016.

Comments: Accepted for publication in the Journal of Informetrics

arXiv:1204.2731 [pdf, other]

How do Ontology Map**s Change in the Life Sciences?

Authors: Anika Gross, Michael Hartung, Andreas Thor, Erhard Rahm

Abstract: Map**s between related ontologies are increasingly used to support data integration and analysis tasks. Changes in the ontologies also require the adaptation of ontology map**s. So far the evolution of ontology map**s has received little attention albeit ontologies change continuously especially in the life sciences. We therefore analyze how map**s between popular life science ontologies e… ▽ More Map**s between related ontologies are increasingly used to support data integration and analysis tasks. Changes in the ontologies also require the adaptation of ontology map**s. So far the evolution of ontology map**s has received little attention albeit ontologies change continuously especially in the life sciences. We therefore analyze how map**s between popular life science ontologies evolve for different match algorithms. We also evaluate which semantic ontology changes primarily affect the map**s. We further investigate alternatives to predict or estimate the degree of future map** changes based on previous ontology and map** transitions. △ Less

Submitted 12 April, 2012; originally announced April 2012.

Comments: Keywords: map** evolution, ontology matching, ontology evolution

arXiv:1108.1631 [pdf, other]

Load Balancing for MapReduce-based Entity Resolution

Authors: Lars Kolb, Andreas Thor, Erhard Rahm

Abstract: The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose… ▽ More The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches. △ Less

Submitted 8 August, 2011; originally announced August 2011.

ACM Class: H.2.4

arXiv:1010.3053 [pdf, other]

Parallel Sorted Neighborhood Blocking with MapReduce

Authors: Lars Kolb, Andreas Thor, Erhard Rahm

Abstract: Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs… ▽ More Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply a tailored data replication. △ Less

Submitted 14 October, 2010; originally announced October 2010.

arXiv:1003.4418 [pdf]

Evaluation of Query Generators for Entity Search Engines

Authors: Stefan Endrullis, Andreas Thor, Erhard Rahm

Abstract: Dynamic web applications such as mashups need efficient access to web data that is only accessible via entity search engines (e.g. product or publication search engines). However, most current mashup systems and applications only support simple keyword searches for retrieving data from search engines. We propose the use of more powerful search strategies building on so-called query generators. For… ▽ More Dynamic web applications such as mashups need efficient access to web data that is only accessible via entity search engines (e.g. product or publication search engines). However, most current mashup systems and applications only support simple keyword searches for retrieving data from search engines. We propose the use of more powerful search strategies building on so-called query generators. For a given set of entities query generators are able to automatically determine a set of search queries to retrieve these entities from an entity search engine. We demonstrate the usefulness of query generators for on-demand web data integration and evaluate the effectiveness and efficiency of query generators for a challenging real-world integration scenario. △ Less

Submitted 23 March, 2010; originally announced March 2010.

ACM Class: H.3.3; H.3.4

Showing 1–13 of 13 results for author: Thor, A