Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation
Authors:
Oluwamayowa O. Amusat,
Harshad Hegde,
Christopher J. Mungall,
Anna Giannakou,
Neil P. Byers,
Dan Gunter,
Kjiersten Fagnan,
Lavanya Ramakrishnan
Abstract:
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automati…
▽ More
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets.
In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
Perspectives for self-driving labs in synthetic biology
Authors:
Hector Garcia Martin,
Tijana Radivojevic,
Jeremy Zucker,
Kristofer Bouchard,
Jess Sustarich,
Sean Peisert,
Dan Arnold,
Nathan Hillson,
Gyorgy Babnigg,
Jose Manuel Marti,
Christopher J. Mungall,
Gregg T. Beckham,
Lucas Waldburger,
James Carothers,
ShivShankar Sundaram,
Deb Agarwal,
Blake A. Simmons,
Tyler Backman,
Deepanwita Banerjee,
Deepti Tanjore,
Lavanya Ramakrishnan,
Anup Singh
Abstract:
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we…
▽ More
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we contend that synthetic biology provides a unique opportunity since the genome provides a single target for affecting the incredibly wide repertoire of biological cell behavior. However, the level of investment required for the creation of biological SDLs is only warranted if directed towards solving difficult and enabling biological questions. Here, we discuss challenges and opportunities in creating SDLs for synthetic biology.
△ Less
Submitted 1 November, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.