-
The Semantic Scholar Open Data Platform
Authors:
Rodney Kinney,
Chloe Anastasiades,
Russell Authur,
Iz Beltagy,
Jonathan Bragg,
Alexandra Buraczynski,
Isabel Cachola,
Stefan Candra,
Yoganand Chandrasekhar,
Arman Cohan,
Miles Crawford,
Doug Downey,
Jason Dunkelberger,
Oren Etzioni,
Rob Evans,
Sergey Feldman,
Joseph Gorney,
David Graham,
Fangzhou Hu,
Regan Huff,
Daniel King,
Sebastian Kohlmeier,
Bailey Kuehl,
Michael Langan,
Daniel Lin
, et al. (23 additional authors not shown)
Abstract:
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by hel** scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF conte…
▽ More
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by hel** scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
Using Machine Learning to Fuse Verbal Autopsy Narratives and Binary Features in the Analysis of Deaths from Hyperglycaemia
Authors:
Thokozile Manaka,
Terence Van Zyl,
Alisha N Wade,
Deepak Kar
Abstract:
Lower-and-middle income countries are faced with challenges arising from a lack of data on cause of death (COD), which can limit decisions on population health and disease management. A verbal autopsy(VA) can provide information about a COD in areas without robust death registration systems. A VA consists of structured data, combining numeric and binary features, and unstructured data as part of a…
▽ More
Lower-and-middle income countries are faced with challenges arising from a lack of data on cause of death (COD), which can limit decisions on population health and disease management. A verbal autopsy(VA) can provide information about a COD in areas without robust death registration systems. A VA consists of structured data, combining numeric and binary features, and unstructured data as part of an open-ended narrative text. This study assesses the performance of various machine learning approaches when analyzing both the structured and unstructured components of the VA report. The algorithms were trained and tested via cross-validation in the three settings of binary features, text features and a combination of binary and text features derived from VA reports from rural South Africa. The results obtained indicate narrative text features contain valuable information for determining COD and that a combination of binary and text features improves the automated COD classification task.
Keywords: Diabetes Mellitus, Verbal Autopsy, Cause of Death, Machine Learning, Natural Language Processing
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Cracking predictions of lithium-ion battery electrodes by X-ray computed tomography and modelling
Authors:
Adam M. Boyce,
Emilio Martínez-Pañeda,
Aaron Wade,
Ye Shui Zhang,
Josh J. Bailey,
Thomas M. M. Heenan,
Dan J. L. Brett,
Paul R. Shearing
Abstract:
Fracture of lithium-ion battery electrodes is found to contribute to capacity fade and reduce the lifespan of a battery. Traditional fracture models for batteries are restricted to consideration of a single, idealised particle; here, advanced X-ray computed tomography (CT) imaging, an electro-chemo-mechanical model and a phase field fracture framework are combined to predict the void-driven fractu…
▽ More
Fracture of lithium-ion battery electrodes is found to contribute to capacity fade and reduce the lifespan of a battery. Traditional fracture models for batteries are restricted to consideration of a single, idealised particle; here, advanced X-ray computed tomography (CT) imaging, an electro-chemo-mechanical model and a phase field fracture framework are combined to predict the void-driven fracture in the electrode particles of a realistic battery electrode microstructure. The electrode is shown to exhibit a highly heterogeneous electrochemical and fracture response that depends on the particle size and distance from the separator/current collector. The model enables prediction of increased cracking due to enlarged cycling voltage windows, cracking susceptibility as a function of electrode thickness, and damage sensitivity to discharge rate. This framework provides a platform that facilitates a deeper understanding of electrode fracture and enables the design of next-generation electrodes with higher capacities and improved degradation characteristics.
△ Less
Submitted 19 February, 2022;
originally announced February 2022.
-
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning
Authors:
Xiangru Tang,
Arjun Nair,
Borui Wang,
Bingyao Wang,
Jai Desai,
Aaron Wade,
Haoran Li,
Asli Celikyilmaz,
Yashar Mehdad,
Dragomir Radev
Abstract:
Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained models, substantial amounts of hallucinated content are found during the human evaluation. Pre-trained models are most commonly fine-tuned with cross-entropy loss for text summarization, which may not be…
▽ More
Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained models, substantial amounts of hallucinated content are found during the human evaluation. Pre-trained models are most commonly fine-tuned with cross-entropy loss for text summarization, which may not be an optimal strategy. In this work, we provide a typology of factual errors with annotation data to highlight the types of errors and move away from a binary understanding of factuality. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called ConFiT. Based on our linguistically-informed typology of errors, we design different modular objectives that each target a specific type. Specifically, we utilize hard negative samples with errors to reduce the generation of factual inconsistency. In order to capture the key information between speakers, we also design a dialogue-specific loss. Using human evaluation and automatic faithfulness metrics, we show that our model significantly reduces all kinds of factual errors on the dialogue summarization, SAMSum corpus. Moreover, our model could be generalized to the meeting summarization, AMI corpus, and it produces significantly higher scores than most of the baselines on both datasets regarding word-overlap metrics.
△ Less
Submitted 9 July, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Approximating the Manifold Structure of Attributed Incentive Salience from Large Scale Behavioural Data. A Representation Learning Approach Based on Artificial Neural Networks
Authors:
Valerio Bonometti,
Mathieu J. Ruiz,
Anders Drachen,
Alex Wade
Abstract:
Incentive salience attribution can be understood as a psychobiological mechanism ascribing relevance to potentially rewarding objects and actions. Despite being an important component of the motivational process guiding our everyday behaviour its study in naturalistic contexts is not straightforward. Here we propose a methodology based on artificial neural networks (ANNs) for approximating latent…
▽ More
Incentive salience attribution can be understood as a psychobiological mechanism ascribing relevance to potentially rewarding objects and actions. Despite being an important component of the motivational process guiding our everyday behaviour its study in naturalistic contexts is not straightforward. Here we propose a methodology based on artificial neural networks (ANNs) for approximating latent states produced by this process in situations where large volumes of behavioural data are available but no experimental control is possible. Leveraging knowledge derived from theoretical and computational accounts of incentive salience attribution we designed an ANN for estimating duration and intensity of future interactions between individuals and a series of video games in a large-scale ($N> 3 \times 10^6$) longitudinal dataset. We found video games to be the ideal context for develo** such methodology due to their reliance on reward mechanics and their ability to provide ecologically robust behavioural measures at scale. When compared to competing approaches our methodology produces representations that are better suited for predicting the intensity future behaviour and approximating some functional properties of attributed incentive salience. We discuss our findings with reference to the adopted theoretical and computational frameworks and suggest how our methodology could be an initial step for estimating attributed incentive salience in large scale behavioural studies.
△ Less
Submitted 26 May, 2022; v1 submitted 3 August, 2021;
originally announced August 2021.
-
DPER: Efficient Parameter Estimation for Randomly Missing Data
Authors:
Thu Nguyen,
Khoi Minh Nguyen-Duy,
Duy Ho Minh Nguyen,
Binh T. Nguyen,
Bruce Alan Wade
Abstract:
The missing data problem has been broadly studied in the last few decades and has various applications in different areas such as statistics or bioinformatics. Even though many methods have been developed to tackle this challenge, most of those are imputation techniques that require multiple iterations through the data before yielding convergence. In addition, such approaches may introduce extra b…
▽ More
The missing data problem has been broadly studied in the last few decades and has various applications in different areas such as statistics or bioinformatics. Even though many methods have been developed to tackle this challenge, most of those are imputation techniques that require multiple iterations through the data before yielding convergence. In addition, such approaches may introduce extra biases and noises to the estimated parameters. In this work, we propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set under some mild assumptions. As the computation is direct without any imputation, our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods while maintaining superior estimation performance. We validate these claims by empirical results on various data sets of different sizes and release all codes in a GitHub repository to contribute to the research community related to this problem.
△ Less
Submitted 6 June, 2021;
originally announced June 2021.
-
EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data
Authors:
Thu Nguyen,
Duy H. M. Nguyen,
Huy Nguyen,
Binh T. Nguyen,
Bruce A. Wade
Abstract:
The problem of monotone missing data has been broadly studied during the last two decades and has many applications in different fields such as bioinformatics or statistics. Commonly used imputation techniques require multiple iterations through the data before yielding convergence. Moreover, those approaches may introduce extra noises and biases to the subsequent modeling. In this work, we derive…
▽ More
The problem of monotone missing data has been broadly studied during the last two decades and has many applications in different fields such as bioinformatics or statistics. Commonly used imputation techniques require multiple iterations through the data before yielding convergence. Moreover, those approaches may introduce extra noises and biases to the subsequent modeling. In this work, we derive exact formulas and propose a novel algorithm to compute the maximum likelihood estimators (MLEs) of a multiple class, monotone missing dataset when all the covariance matrices of all categories are assumed to be equal, namely EPEM. We then illustrate an application of our proposed methods in Linear Discriminant Analysis (LDA). As the computation is exact, our EPEM algorithm does not require multiple iterations through the data as other imputation approaches, thus promising to handle much less time-consuming than other methods. This effectiveness was validated by empirical results when EPEM reduced the error rates significantly and required a short computation time compared to several imputation-based approaches. We also release all codes and data of our experiments in one GitHub repository to contribute to the research community related to this problem.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
From Theory to Behaviour: Towards a General Model of Engagement
Authors:
Valerio Bonometti,
Charles Ringer,
Mathieu Ruiz,
Alex Wade,
Anders Drachen
Abstract:
Engagement is a fuzzy concept. In the present work we operationalize engagement mechanistically by linking it directly to human behaviour and show that the construct of engagement can be used for sha** and interpreting data-driven methods. First we outline a formal framework for engagement modelling. Second we expanded on our previous work on theory-inspired data-driven approaches to better mode…
▽ More
Engagement is a fuzzy concept. In the present work we operationalize engagement mechanistically by linking it directly to human behaviour and show that the construct of engagement can be used for sha** and interpreting data-driven methods. First we outline a formal framework for engagement modelling. Second we expanded on our previous work on theory-inspired data-driven approaches to better model the engagement process by proposing a new modelling technique, the Melchoir Model. Third, we illustrate how, through model comparison and inspection, we can link machine-learned models and underlying theoretical frameworks. Finally we discuss our results in light of a theory-driven hypothesis and highlight potential application of our work in industry.
△ Less
Submitted 27 April, 2020;
originally announced April 2020.
-
CORD-19: The COVID-19 Open Research Dataset
Authors:
Lucy Lu Wang,
Kyle Lo,
Yoganand Chandrasekhar,
Russell Reas,
Jiangjiang Yang,
Doug Burdick,
Darrin Eide,
Kathryn Funk,
Yannis Katsis,
Rodney Kinney,
Yunyao Li,
Ziyang Liu,
William Merrill,
Paul Mooney,
Dewey Murdick,
Devvret Rishi,
Jerry Sheehan,
Zhihong Shen,
Brandon Stilson,
Alex Wade,
Kuansan Wang,
Nancy Xin Ru Wang,
Chris Wilhelm,
Boya Xie,
Douglas Raymond
, et al. (3 additional authors not shown)
Abstract:
The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the b…
▽ More
The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for COVID-19.
△ Less
Submitted 10 July, 2020; v1 submitted 22 April, 2020;
originally announced April 2020.
-
Modelling Early User-Game Interactions for Joint Estimation of Survival Time and Churn Probability
Authors:
Valerio Bonometti,
Charles Ringer,
Mark Hall,
Alex R. Wade,
Anders Drachen
Abstract:
Data-driven approaches which aim to identify and predict player engagement are becoming increasingly popular in games industry contexts. This is due to the growing practice of tracking and storing large volumes of in-game telemetries coupled with a desire to tailor the gaming experience to the end-user's needs. These approaches are particularly useful not just for companies adopting Game-as-a-Serv…
▽ More
Data-driven approaches which aim to identify and predict player engagement are becoming increasingly popular in games industry contexts. This is due to the growing practice of tracking and storing large volumes of in-game telemetries coupled with a desire to tailor the gaming experience to the end-user's needs. These approaches are particularly useful not just for companies adopting Game-as-a-Service (GaaS) models (e.g. for re-engagement strategies) but also for those working under persistent content-delivery regimes (e.g. for better audience targeting). A major challenge for the latter is to build engagement models of the user which are data-efficient, holistic and can generalize across multiple game titles and genres with minimal adjustments. This work leverages a theoretical framework rooted in engagement and behavioural science research for building a model able to estimate engagement-related behaviours employing only a minimal set of game-agnostic metrics. Through a series of experiments we show how, by modelling early user-game interactions, this approach can make joint estimates of long-term survival time and churn probability across several single-player games in a range of genres. The model proposed is very suitable for industry applications since it relies on a minimal set of metrics and observations, scales well with the number of users and is explicitly designed to work across a diverse range of titles.
△ Less
Submitted 21 August, 2019; v1 submitted 27 May, 2019;
originally announced May 2019.