Search | arXiv e-print repository

Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models

Authors: Sowmya S. Sundaram, Benjamin Solomon, Avani Khatri, Anisha Laumas, Purvesh Khatri, Mark A. Musen

Abstract: Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluati… ▽ More Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.01). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base. △ Less

Submitted 17 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2312.09107 [pdf]

A Comprehensive Approach to Ensuring Quality in Spreadsheet-Based Metadata

Authors: Martin J. O'Connor, Marcos Martínez-Romero, Mete Ugur Akdogan, Josef Hardi, Mark A. Musen

Abstract: While scientists increasingly recognize the importance of metadata in describing their data, spreadsheets remain the preferred tool for supplying this information despite their limitations in ensuring compliance and quality. Various tools have been developed to address these limitations, but they suffer from their own shortcomings, such as steep learning curves and limited customization. In this p… ▽ More While scientists increasingly recognize the importance of metadata in describing their data, spreadsheets remain the preferred tool for supplying this information despite their limitations in ensuring compliance and quality. Various tools have been developed to address these limitations, but they suffer from their own shortcomings, such as steep learning curves and limited customization. In this paper, we describe an end-to-end approach that supports spreadsheet-based entry of metadata while providing rigorous compliance and quality control. Our approach employs several key strategies, including customizable templates for defining metadata, integral support for the use of controlled terminologies when defining these templates, and an interactive Web-based tool that allows users to rapidly identify and fix errors in the spreadsheet-based metadata they supply. We demonstrate how this approach is being deployed in a biomedical consortium to define and collect metadata about scientific experiments. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2307.13085 [pdf, other]

doi 10.4126/FRL01-006444995

Making Metadata More FAIR Using Large Language Models

Authors: Sowmya S. Sundaram, Mark A. Musen

Abstract: With the global increase in experimental data artifacts, harnessing them in a unified fashion leads to a major stumbling block - bad metadata. To bridge this gap, this work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata. Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical simil… ▽ More With the global increase in experimental data artifacts, harnessing them in a unified fashion leads to a major stumbling block - bad metadata. To bridge this gap, this work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata. Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms. This measure can then be utilized for analyzing varied metadata, by suggesting terms for compliance or grou** similar terms for identification of replaceable terms. The efficacy of the algorithm is presented qualitatively and quantitatively on publicly available research artifacts and demonstrates large gains across metadata related tasks through an in-depth study of a wide variety of Large Language Models (LLMs). This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Journal ref: DaMaLOS 2023

arXiv:2208.02836 [pdf]

Modeling community standards for metadata as templates makes data FAIR

Authors: Mark A. Musen, Martin J. O'Connor, Erik Schultes, Marcos Martinez-Romero, Josef Hardi, John Graybeal

Abstract: It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to defin… ▽ More It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets--both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing. △ Less

Submitted 14 October, 2022; v1 submitted 4 August, 2022; originally announced August 2022.

Comments: 20 pages, 1 table, 5 figures

arXiv:2105.07238 [pdf]

Using Ethnographic Methods to Classify the Human Experience in Medicine: A Case Study of the Presence Ontology

Authors: Amrapali Maitra, Maulik R. Kamdar, Donna M. Zulman, Marie C. Haverfield, Cati Brown-Johnson, Rachel Schwartz, Sonoo Thadaney Israni, Abraham Verghese, Mark A. Musen

Abstract: Objective Although social and environmental factors are central to provider patient interactions, the data that reflect these factors can be incomplete, vague, and subjective. We sought to create a conceptual framework to describe and classify data about presence, the domain of interpersonal connection in medicine. Methods Our top down approach for ontology development based on the concept of re… ▽ More Objective Although social and environmental factors are central to provider patient interactions, the data that reflect these factors can be incomplete, vague, and subjective. We sought to create a conceptual framework to describe and classify data about presence, the domain of interpersonal connection in medicine. Methods Our top down approach for ontology development based on the concept of relationality included 1) broad survey of social sciences literature and systematic literature review of more than 20,000 articles around interpersonal connection in medicine, 3) relational ethnography of clinical encounters (5 pilot, 27 full) and 4) interviews about relational work with 40 medical and nonmedical professionals. We formalized the model using the Web Ontology Language in the Protege ontology editor. We iteratively evaluated and refined the Presence Ontology through manual expert review and automated annotation of literature. Results and Discussion The Presence Ontology facilitates the naming and classification of concepts that would otherwise be vague. Our model categorizes contributors to healthcare encounters and factors such as Communication, Emotions, Tools, and Environment. Ontology evaluation indicated that Cognitive Models (both patients explanatory models and providers caregiving approaches) influenced encounters and were subsequently incorporated. We show how ethnographic methods based in relationality can aid the representation of experiential concepts (e.g., empathy, trust). Our ontology could support informatics applications to improve healthcare such annotation of videotaped encounters, clinical instruments to measure presence, or EHR based reminders for providers. Conclusion The Presence Ontology provides a model for using ethnographic approaches to classify interpersonal data. △ Less

Submitted 15 May, 2021; originally announced May 2021.

Comments: 15 pages, 4 figures, 57 references

arXiv:2007.14474 [pdf]

Construction and Usage of a Human Body Common Coordinate Framework Comprising Clinical, Semantic, and Spatial Ontologies

Authors: Katy Börner, Ellen M. Quardokus, Bruce W. Herr II, Leonard E. Cross, Elizabeth G. Record, Yingnan Ju, Andreas D. Bueckle, James P. Sluka, Jonathan C. Silverstein, Kristen M. Browne, Sanjay Jain, Clive H. Wasserfall, Marda L. Jorgensen, Jeffrey M. Spraggins, Nathan H. Patterson, Mark A. Musen, Griffin M. Weber

Abstract: The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'ma… ▽ More The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'map**' them into a common three-dimensional (3D) space is a major challenge. The key to making this possible is a 'Common Coordinate Framework' (CCF), which provides a semantically annotated, 3D reference system for the entire body. The CCF enables contributors to HuBMAP to 'register' specimens and datasets within a common spatial reference system, and it supports a standardized way to query and 'explore' data in a spatially and semantically explicit manner. [...] This paper describes the construction and usage of a CCF for the human body and its reference implementation in HuBMAP. The CCF consists of (1) a CCF Clinical Ontology, which provides metadata about the specimen and donor (the 'who'); (2) a CCF Semantic Ontology, which describes 'what' part of the body a sample came from and details anatomical structures, cell types, and biomarkers (ASCT+B); and (3) a CCF Spatial Ontology, which indicates 'where' a tissue sample is located in a 3D coordinate system. An initial version of all three CCF ontologies has been implemented for the first HuBMAP Portal release. It was successfully used by Tissue Map** Centers to semantically annotate and spatially register 48 kidney and spleen tissue blocks. The blocks can be queried and explored in their clinical, semantic, and spatial context via the CCF user interface in the HuBMAP Portal. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: 24 pages with SI, 6 figures, 5 tables

arXiv:2006.04161 [pdf, other]

An Empirical Meta-analysis of the Life Sciences (Linked?) Open Data on the Web

Authors: Maulik R. Kamdar, Mark A. Musen

Abstract: While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Ope… ▽ More While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 publicly available biomedical linked data graphs into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or map**s, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web. △ Less

Submitted 7 June, 2020; originally announced June 2020.

Comments: Under Review at Nature Scientific Data

arXiv:1907.02106 [pdf, other]

doi 10.1007/978-3-030-30796-7_26

Use of OWL and Semantic Web Technologies at Pinterest

Authors: Rafael S. Gonçalves, Matthew Horridge, Rui Li, Yu Liu, Mark A. Musen, Csongor I. Nyulas, Evelyn Obamos, Dhananjay Shrouty, David Temple

Abstract: Pinterest is a popular Web application that has over 250 million active users. It is a visual discovery engine for finding ideas for recipes, fashion, weddings, home decoration, and much more. In the last year, the company adopted Semantic Web technologies to create a knowledge graph that aims to represent the vast amount of content and users on Pinterest, to help both content recommendation and a… ▽ More Pinterest is a popular Web application that has over 250 million active users. It is a visual discovery engine for finding ideas for recipes, fashion, weddings, home decoration, and much more. In the last year, the company adopted Semantic Web technologies to create a knowledge graph that aims to represent the vast amount of content and users on Pinterest, to help both content recommendation and ads targeting. In this paper, we present the engineering of an OWL ontology---the Pinterest Taxonomy---that forms the core of Pinterest's knowledge graph, the Pinterest Taste Graph. We describe modeling choices and enhancements to WebProtégé that we used for the creation of the ontology. In two months, eight Pinterest engineers, without prior experience of OWL and WebProtégé, revamped an existing taxonomy of noisy terms into an OWL ontology. We share our experience and present the key aspects of our work that we believe will be useful for others working in this area. △ Less

Submitted 3 July, 2019; originally announced July 2019.

arXiv:1905.06480 [pdf]

doi 10.1007/978-3-319-68204-4_10

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments

Authors: Rafael S. Gonçalves, Martin J. O'Connor, Marcos Martínez-Romero, Attila L. Egyedi, Debra Willrett, John Graybeal, Mark A. Musen

Abstract: The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed--the CEDAR Workbench--is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. T… ▽ More The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed--the CEDAR Workbench--is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source. △ Less

Submitted 15 May, 2019; originally announced May 2019.

arXiv:1903.09270 [pdf]

doi 10.1093/database/baz059

Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases

Authors: Marcos Martínez-Romero, Martin J. O'Connor, Attila L. Egyedi, Debra Willrett, Josef Hardi, John Graybeal, Mark A. Musen

Abstract: Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata… ▽ More Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata acquisition process is onerous and time consuming, with little interactive guidance or assistance provided to users. Secondary problems include the lack of validation and sparse use of standardized terms or ontologies when authoring metadata. There is a pressing need for improvements to the metadata acquisition process that will help users to enter metadata quickly and accurately. In this paper we outline a recommendation system for metadata that aims to address this challenge. Our approach uses association rule mining to uncover hidden associations among metadata values and to represent them in the form of association rules. These rules are then used to present users with real-time recommendations when authoring metadata. The novelties of our method are that it is able to combine analyses of metadata from multiple repositories when generating recommendations and can enhance those recommendations by aligning them with ontology terms. We implemented our approach as a service integrated into the CEDAR Workbench metadata authoring platform, and evaluated it using metadata from two public biomedical repositories: US-based National Center for Biotechnology Information (NCBI) BioSample and European Bioinformatics Institute (EBI) BioSamples. The results show that our approach is able to use analyses of previous entered metadata coupled with ontology-based map**s to present users with accurate recommendations when authoring metadata. △ Less

Submitted 21 March, 2019; originally announced March 2019.

arXiv:1903.08206 [pdf, other]

doi 10.1007/978-3-030-21348-0_10

Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings

Authors: Rafael S. Gonçalves, Maulik R. Kamdar, Mark A. Musen

Abstract: The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity---there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represe… ▽ More The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity---there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms. △ Less

Submitted 16 May, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

arXiv:1903.05704 [pdf, other]

doi 10.1145/3308558.3313487

HopRank: How Semantic Structure Influences Teleportation in PageRank (A Case Study on BioPortal)

Authors: Lisette Espín-Noboa, Florian Lemmerich, Simon Walk, Markus Strohmaier, Mark A. Musen

Abstract: This paper introduces HopRank, an algorithm for modeling human navigation on semantic networks. HopRank leverages the assumption that users know or can see the whole structure of the network. Therefore, besides following links, they also follow nodes at certain distances (i.e., k-hop neighborhoods), and not at random as suggested by PageRank, which assumes only links are known or visible. We obser… ▽ More This paper introduces HopRank, an algorithm for modeling human navigation on semantic networks. HopRank leverages the assumption that users know or can see the whole structure of the network. Therefore, besides following links, they also follow nodes at certain distances (i.e., k-hop neighborhoods), and not at random as suggested by PageRank, which assumes only links are known or visible. We observe such preference towards k-hop neighborhoods on BioPortal, one of the leading repositories of biomedical ontologies on the Web. In general, users navigate within the vicinity of a concept. But they also "jump" to distant concepts less frequently. We fit our model on 11 ontologies using the transition matrix of clickstreams, and show that semantic structure can influence teleportation in PageRank. This suggests that users--to some extent--utilize knowledge about the underlying structure of ontologies, and leverage it to reach certain pieces of information. Our results help the development and improvement of user interfaces for ontology exploration. △ Less

Submitted 15 March, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

Comments: Published at TheWebConf 2019 (WWW'19)

arXiv:1902.11162 [pdf]

The FAIR Funder pilot programme to make it easy for funders to require and for grantees to produce FAIR Data

Authors: P. Wittenburg, H. Pergl Sustkova, A. Montesanti, S. M. Bloemers, S. H. de Waard, M. A. Musen, J. B. Graybeal, K. M. Hettne, A. Jacobsen, R. Pergl, R. W. W. Hooft, C. Staiger, C. W. G. van Gelder, S. L. Knijnenburg, A. C. van Arkel, B. Meerman, M. D. Wilkinson, S-A Sansone, P. Rocca-Serra, P. McQuilton, A. N. Gonzalez-Beltran, G. J. C. Aben, P. Henning, S. Alencar, C. Ribeiro , et al. (35 additional authors not shown)

Abstract: There is a growing acknowledgement in the scientific community of the importance of making experimental data machine findable, accessible, interoperable, and reusable (FAIR). Recognizing that high quality metadata are essential to make datasets FAIR, members of the GO FAIR Initiative and the Research Data Alliance (RDA) have initiated a series of workshops to encourage the creation of Metadata for… ▽ More There is a growing acknowledgement in the scientific community of the importance of making experimental data machine findable, accessible, interoperable, and reusable (FAIR). Recognizing that high quality metadata are essential to make datasets FAIR, members of the GO FAIR Initiative and the Research Data Alliance (RDA) have initiated a series of workshops to encourage the creation of Metadata for Machines (M4M), enabling any self-identified stakeholder to define and promote the reuse of standardized, comprehensive machine-actionable metadata. The funders of scientific research recognize that they have an important role to play in ensuring that experimental results are FAIR, and that high quality metadata and careful planning for FAIR data stewardship are central to these goals. We describe the outcome of a recent M4M workshop that has led to a pilot programme involving two national science funders, the Health Research Board of Ireland (HRB) and the Netherlands Organisation for Health Research and Development (ZonMW). These funding organizations will explore new technologies to define at the time that a request for proposals is issued the minimal set of machine-actionable metadata that they would like investigators to use to annotate their datasets, to enable investigators to create such metadata to help make their data FAIR, and to develop data-stewardship plans that ensure that experimental data will be managed appropriately abiding by the FAIR principles. The FAIR Funders design envisions a data-management workflow having seven essential stages, where solution providers are openly invited to participate. The initial pilot programme will launch using existing computer-based tools of those who attended the M4M Workshop. △ Less

Submitted 6 March, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

Comments: This is a pre-print of the FAIR Funders pilot, an outcome of the first Metadata for Machines workshop, see: https://www.go-fair.org/resources/go-fair-workshop-series/metadata-for-machines-workshops/. Corresponding author: E. A Schultes, ORCID 0000-0001-8888-635X

arXiv:1902.08251 [pdf]

doi 10.1145/3308560.3317707

WebProtégé: A Cloud-Based Ontology Editor

Authors: Matthew Horridge, Rafael S. Gonçalves, Csongor I. Nyulas, Tania Tudorache, Mark A. Musen

Abstract: We present WebProtégé, a tool to develop ontologies represented in the Web Ontology Language (OWL). WebProtégé is a cloud-based application that allows users to collaboratively edit OWL ontologies, and it is available for use at https://webprotege.stanford.edu. WebProtégeé currently hosts more than 68,000 OWL ontology projects and has over 50,000 user accounts. In this paper, we detail the main ne… ▽ More We present WebProtégé, a tool to develop ontologies represented in the Web Ontology Language (OWL). WebProtégé is a cloud-based application that allows users to collaboratively edit OWL ontologies, and it is available for use at https://webprotege.stanford.edu. WebProtégeé currently hosts more than 68,000 OWL ontology projects and has over 50,000 user accounts. In this paper, we detail the main new features of the latest version of WebProtégé. △ Less

Submitted 5 March, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

arXiv:1808.06907 [pdf]

doi 10.1038/sdata.2019.21

The variable quality of metadata about biological samples used in biomedical experiments

Authors: Rafael S. Gonçalves, Mark A. Musen

Abstract: We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records… ▽ More We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets. △ Less

Submitted 18 January, 2019; v1 submitted 17 August, 2018; originally announced August 2018.

Comments: arXiv admin note: text overlap with arXiv:1708.01286

arXiv:1708.01286 [pdf]

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies

Authors: Rafael S. Gonçalves, Martin J. O'Connor, Marcos Martínez-Romero, John Graybeal, Mark A. Musen

Abstract: The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample--a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata… ▽ More The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample--a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled--15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets. △ Less

Submitted 3 August, 2017; originally announced August 2017.

arXiv:1611.05973 [pdf]

doi 10.1186/s13326-017-0128-y

NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

Authors: Marcos Martinez-Romero, Clement Jonquet, Martin J. O'Connor, John Graybeal, Alejandro Pazos, Mark A. Musen

Abstract: Biomedical researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) release… ▽ More Biomedical researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) released the Ontology Recommender, which is a service that receives a biomedical text corpus or a list of keywords and suggests ontologies appropriate for referencing the indicated terms. We developed a new version of the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new recommendation approach that evaluates the relevance of an ontology to biomedical text data according to four criteria: (1) the extent to which the ontology covers the input data; (2) the acceptance of the ontology in the biomedical community; (3) the level of detail of the ontology classes that cover the input data; and (4) the specialization of the ontology to the domain of the input data. Our evaluation shows that the enhanced recommender provides higher quality suggestions than the original approach, providing better coverage of the input data, more detailed information about their concepts, increased specialization for the domain of the input data, and greater acceptance and use in the community. In addition, it provides users with more explanatory information, along with suggestions of not only individual ontologies but also groups of ontologies. It also can be customized to fit the needs of different scenarios. Ontology Recommender 2.0 combines the strengths of its predecessor with a range of adjustments and new features that improve its reliability and usefulness. Ontology Recommender 2.0 recommends over 500 biomedical ontologies from the NCBO BioPortal platform, where it is openly available. △ Less

Submitted 25 May, 2017; v1 submitted 17 November, 2016; originally announced November 2016.

Comments: 29 pages, 8 figures, 11 tables

ACM Class: I.2.4

Journal ref: Journal of Biomedical Semantics 8 (2017) 1-22

arXiv:1407.2002 [pdf, ps, other]

doi 10.1016/j.jbi.2014.06.004

Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains

Authors: Simon Walk, Philipp Singer, Markus Strohmaier, Tania Tudorache, Mark A. Musen, Natalya F. Noy

Abstract: Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases (ICD) as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For… ▽ More Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases (ICD) as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For example, the 11th revision of the ICD, which is currently under active development by the WHO contains nearly 50,000 classes representing a vast variety of different diseases and causes of death. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no single individual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scale efforts involving just a few domain experts to large-scale projects that require effective collaboration between dozens or even hundreds of experts, practitioners and other stakeholders. Understanding how these stakeholders collaborate will enable us to improve editing environments that support such collaborations. We uncover how large ontology-engineering projects, such as the ICD in its 11th revision, unfold by analyzing usage logs of five different biomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interaction patterns (e.g., which properties users subsequently change) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identify commonalities and differences between different projects that have implications for project managers, ontology editors, developers and contributors working on collaborative ontology-engineering projects and tools in the biomedical domain. △ Less

Submitted 29 February, 2016; v1 submitted 8 July, 2014; originally announced July 2014.

Comments: Published in the Journal of Biomedical Informatics

arXiv:1303.1482 [pdf]

Graph-Grammar Assistance for Automated Generation of Influence Diagrams

Authors: John W. Egar, Mark A. Musen

Abstract: One of the most difficult aspects of modeling complex dilemmas in decision-analytic terms is composing a diagram of relevance relations from a set of domain concepts. Decision models in domains such as medicine, however, exhibit certain prototypical patterns that can guide the modeling process. Medical concepts can be classified according to semantic types that have characteristic positions and t… ▽ More One of the most difficult aspects of modeling complex dilemmas in decision-analytic terms is composing a diagram of relevance relations from a set of domain concepts. Decision models in domains such as medicine, however, exhibit certain prototypical patterns that can guide the modeling process. Medical concepts can be classified according to semantic types that have characteristic positions and typical roles in an influence-diagram model. We have developed a graph-grammar production system that uses such inherent interrelationships among medical terms to facilitate the modeling of medical decisions. △ Less

Submitted 6 March, 2013; originally announced March 2013.

Comments: Appears in Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence (UAI1993)

Report number: UAI-P-1993-PG-235-242

Showing 1–19 of 19 results for author: Musen, M A