Search | arXiv e-print repository

Individual context-free online community health indicators fail to identify open source software sustainability

Authors: Yo Yehudi, Carole Goble, Caroline Jay

Abstract: The global value of open source software is estimated to be in the billions or trillions worldwide1, but despite this, it is often under-resourced and subject to high-impact security vulnerabilities and stability failures2,3. In order to investigate factors contributing to open source community longevity, we monitored thirty-eight open source projects over the period of a year, focusing primarily,… ▽ More The global value of open source software is estimated to be in the billions or trillions worldwide1, but despite this, it is often under-resourced and subject to high-impact security vulnerabilities and stability failures2,3. In order to investigate factors contributing to open source community longevity, we monitored thirty-eight open source projects over the period of a year, focusing primarily, but not exclusively, on open science-related online code-oriented communities. We measured performance indicators, using both subjective and qualitative measures (participant surveys), as well as using computational scripts to retrieve and analyse indicators associated with these projects' online source control codebases. None of the projects were abandoned during this period, and only one project entered a planned shutdown. Project ages spanned from under one year to over forty years old at the start of the study, and results were highly heterogeneous, showing little commonality across documentation, mean response times for issues and code contributions, and available funding/staffing resources. Whilst source code-based indicators were able to offer some insights into project activity, we observed that similar indicators across different projects often had very different meanings when context was taken into account. We conclude that the individual context-free metrics we studied were not sufficient or essential for project longevity and sustainability, and might even become detrimental if used to support high-stakes decision making. When attempting to understand an online open community's longer-term sustainability, we recommend that researchers avoid cross-project quantitative comparisons, and advise instead that they use single-project-level assessments which combine quantitative measures with contextualising qualitative data. △ Less

Submitted 9 May, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: 99 pages, 34 tables, 19 figures

arXiv:2208.12346 [pdf]

doi 10.1038/s41597-023-02627-9

Subjective data models in bioinformatics: Do wet-lab and computational biologists comprehend data differently?

Authors: Yo Yehudi, Lukas Hughes-Noehrer, Carole Goble, Caroline Jay

Abstract: Biological science produces large amounts of data in a variety of formats, which necessitates the use of computational tools to process, integrate, analyse, and glean insights from the data. Researchers who use computational biology tools range from those who use computers primarily for communication and data lookup, to those who write complex software programs in order to analyse data or make it… ▽ More Biological science produces large amounts of data in a variety of formats, which necessitates the use of computational tools to process, integrate, analyse, and glean insights from the data. Researchers who use computational biology tools range from those who use computers primarily for communication and data lookup, to those who write complex software programs in order to analyse data or make it easier for others to do so. This research examines how people differ in how they conceptualise the same data, for which we coin the term "subjective data models". We interviewed 22 people with biological experience and varied levels of computational experience to elicit their perceptions of the same subset of biological data entities. The results suggest that many people had fluid subjective data models that would change depending on the circumstance or tool they were using. Surprisingly, results generally did not seem to cluster around a participant's computational experience/education levels, or the lack thereof. We further found that people did not consistently map entities from an abstract data model to the same identifiers in real-world files, and found that certain data identifier formats were easier for participants to infer meaning from than others. Real-world implications of these findings suggests that 1) software engineers should design interfaces for task performance and emulate other related popular user interfaces, rather than targeting a person's professional background; 2) when insufficient context is provided, people may guess what data means, whether or not their guesses are correct, emphasising the importance of providing contextual metadata when preparing data for re-use by other, to remove the need for potentially erroneous guesswork. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: 18 pages, 1 figure, 3 tables

arXiv:2205.12098 [pdf, other]

COVID-19: An exploration of consecutive systemic barriers to pathogen-related data sharing during a pandemic

Authors: Yo Yehudi, Lukas Hughes-Noehrer, Carole Goble, Caroline Jay

Abstract: In 2020, the COVID-19 pandemic resulted in a rapid response from governments and researchers worldwide. As of late 2023, over millions have died as a result of COVID-19, with many COVID-19 survivors going on to experience long-term effects weeks, months, or years after their illness. Despite this staggering toll, those who work with pandemic-relevant data often face significant systemic barriers t… ▽ More In 2020, the COVID-19 pandemic resulted in a rapid response from governments and researchers worldwide. As of late 2023, over millions have died as a result of COVID-19, with many COVID-19 survivors going on to experience long-term effects weeks, months, or years after their illness. Despite this staggering toll, those who work with pandemic-relevant data often face significant systemic barriers to accessing, sharing or re-using this data. In this paper we report results of a study, where we interviewed data professionals working with COVID-19-relevant data types including social media, mobility, viral genome, testing, infection, hospital admission, and deaths. These data types are variously used for pandemic spread modelling, healthcare system strain awareness, and devising therapeutic treatments for COVID-19. Barriers to data access, sharing and re-use include the cost of access to data (primarily certain healthcare sources and mobility data from mobile phone carriers), human throughput bottlenecks, unclear pathways to request access to data, unnecessarily strict access controls and data re-use policies, unclear data provenance, inability to link separate data sources that could collectively create a more complete picture, poor adherence to metadata standards, and a lack of computer-suitable data formats. △ Less

Submitted 22 December, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

Comments: 35 pages including references, three figures. To be submitted to Data and Policy

Showing 1–3 of 3 results for author: Yehudi, Y