-
New bounds on heavy axions with an X-ray free electron laser
Authors:
Jack W. D. Halliday,
Giacomo Marocco,
Konstantin A. Beyer,
Charles Heaton,
Motoaki Nakatsutsumi,
Thomas R. Preston,
Charles D. Arrowsmith,
Carsten Baehtz,
Sebastian Goede,
Oliver Humphries,
Alejandro Laso Garcia,
Richard Plackett,
Pontus Svensson,
Georgios Vacalis,
Justin Wark,
Daniel Wood,
Ulf Zastrau,
Robert Bingham,
Ian Shipsey,
Subir Sarkar,
Gianluca Gregori
Abstract:
We present new exclusion bounds obtained at the European X-ray Free Electron Laser facility (EuXFEL) on axion-like particles (ALPs) in the mass range 10^{-3} eV < m_a < 10^{4} eV. Our experiment exploits the Primakoff effect via which photons can, in the presence of a strong external electric field, decay into axions, which then convert back into photons after passing through an opaque wall. While…
▽ More
We present new exclusion bounds obtained at the European X-ray Free Electron Laser facility (EuXFEL) on axion-like particles (ALPs) in the mass range 10^{-3} eV < m_a < 10^{4} eV. Our experiment exploits the Primakoff effect via which photons can, in the presence of a strong external electric field, decay into axions, which then convert back into photons after passing through an opaque wall. While similar searches have been performed previously at a 3^rd generation synchrotron [1], our work demonstrates improved sensitivity, exploiting the higher brightness of X-rays at EuXFEL.
△ Less
Submitted 6 July, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents
Authors:
Nan Zhang,
Connor Heaton,
Sean Timothy Okonsky,
Prasenjit Mitra,
Hilal Ezgi Toraman
Abstract:
Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ab…
▽ More
Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
Leveraging External Knowledge Resources to Enable Domain-Specific Comprehension
Authors:
Saptarshi Sengupta,
Connor Heaton,
Prasenjit Mitra,
Soumalya Sarkar
Abstract:
Machine Reading Comprehension (MRC) has been a long-standing problem in NLP and, with the recent introduction of the BERT family of transformer based language models, it has come a long way to getting solved. Unfortunately, however, when BERT variants trained on general text corpora are applied to domain-specific text, their performance inevitably degrades on account of the domain shift i.e. genre…
▽ More
Machine Reading Comprehension (MRC) has been a long-standing problem in NLP and, with the recent introduction of the BERT family of transformer based language models, it has come a long way to getting solved. Unfortunately, however, when BERT variants trained on general text corpora are applied to domain-specific text, their performance inevitably degrades on account of the domain shift i.e. genre/subject matter discrepancy between the training and downstream application data. Knowledge graphs act as reservoirs for either open or closed domain information and prior studies have shown that they can be used to improve the performance of general-purpose transformers in domain-specific applications. Building on existing work, we introduce a method using Multi-Layer Perceptrons (MLPs) for aligning and integrating embeddings extracted from knowledge graphs with the embeddings spaces of pre-trained language models (LMs). We fuse the aligned embeddings with open-domain LMs BERT and RoBERTa, and fine-tune them for two MRC tasks namely span detection (COVID-QA) and multiple-choice questions (PubMedQA). On the COVID-QA dataset, we see that our approach allows these models to perform similar to their domain-specific counterparts, Bio/Sci-BERT, as evidenced by the Exact Match (EM) metric. With regards to PubMedQA, we observe an overall improvement in accuracy while the F1 stays relatively the same over the domain-specific models.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Quality > Quantity: Synthetic Corpora from Foundation Models for Closed-Domain Extractive Question Answering
Authors:
Saptarshi Sengupta,
Connor Heaton,
Shreya Ghosh,
Preslav Nakov,
Prasenjit Mitra
Abstract:
Domain adaptation, the process of training a model in one domain and applying it to another, has been extensively explored in machine learning. While training a domain-specific foundation model (FM) from scratch is an option, recent methods have focused on adapting pre-trained FMs for domain-specific tasks. However, our experiments reveal that either approach does not consistently achieve state-of…
▽ More
Domain adaptation, the process of training a model in one domain and applying it to another, has been extensively explored in machine learning. While training a domain-specific foundation model (FM) from scratch is an option, recent methods have focused on adapting pre-trained FMs for domain-specific tasks. However, our experiments reveal that either approach does not consistently achieve state-of-the-art (SOTA) results in the target domain. In this work, we study extractive question answering within closed domains and introduce the concept of targeted pre-training. This involves determining and generating relevant data to further pre-train our models, as opposed to the conventional philosophy of utilizing domain-specific FMs trained on a wide range of data. Our proposed framework uses Galactica to generate synthetic, ``targeted'' corpora that align with specific writing styles and topics, such as research papers and radiology reports. This process can be viewed as a form of knowledge distillation. We apply our method to two biomedical extractive question answering datasets, COVID-QA and RadQA, achieving a new benchmark on the former and demonstrating overall improvements on the latter. Code available at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Expected Performance of the EUSO-SPB2 Fluorescence Telescope
Authors:
G. Filippatos,
M. Battisti,
M. Bertaina,
F. Bisconti,
J. Esser,
C. Heaton,
G. Osteria,
F. Sarazin,
L. Wiencke
Abstract:
The Extreme Universe Space Observatory Supper Pressure Balloon 2 (EUSO-SPB2) is under development, and will prototype instrumentation for future satellite-based missions, including the Probe of Extreme Multi-Messenger Astrophysics (POEMMA). EUSO-SPB2 will consist of two telescopes. The first is a Cherenkov telescope (CT) being developed to identify and estimate the background sources for future be…
▽ More
The Extreme Universe Space Observatory Supper Pressure Balloon 2 (EUSO-SPB2) is under development, and will prototype instrumentation for future satellite-based missions, including the Probe of Extreme Multi-Messenger Astrophysics (POEMMA). EUSO-SPB2 will consist of two telescopes. The first is a Cherenkov telescope (CT) being developed to identify and estimate the background sources for future below-the-limb very high energy (E>10 PeV) astrophysical neutrino observations, as well as above-the-limb cosmic ray induced signals (E>1 PeV). The second is a fluorescence telescope (FT) being developed for detection of Ultra High Energy Cosmic Rays (UHECRs). In preparation for the expected launch in 2023, extensive simulations tuned by preliminary laboratory measurements have been preformed to understand the FT capabilities. The energy threshold has been estimated at $10^{18.2}$ eV, and results in a maximum detection rate at $10^{18.6}$ eV when taking into account the shape of the UHECR spectrum. In addition, onboard software has been developed based on the simulations as well as experience with previous EUSO missions. This includes a level 1 trigger to be run on the computationally limited flight hardware, as well as a deep learning based prioritization algorithm in order to accommodate the balloon's telemetry budget. These techniques could also be used later for future, space-based missions.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Learning To Describe Player Form in The MLB
Authors:
Connor Heaton,
Prasenjit Mitra
Abstract:
Major League Baseball (MLB) has a storied history of using statistics to better understand and discuss the game of baseball, with an entire discipline of statistics dedicated to the craft, known as sabermetrics. At their core, all sabermetrics seek to quantify some aspect of the game, often a specific aspect of a player's skill set - such as a batter's ability to drive in runs (RBI) or a pitcher's…
▽ More
Major League Baseball (MLB) has a storied history of using statistics to better understand and discuss the game of baseball, with an entire discipline of statistics dedicated to the craft, known as sabermetrics. At their core, all sabermetrics seek to quantify some aspect of the game, often a specific aspect of a player's skill set - such as a batter's ability to drive in runs (RBI) or a pitcher's ability to keep batters from reaching base (WHIP). While useful, such statistics are fundamentally limited by the fact that they are derived from an account of what happened on the field, not how it happened. As a first step towards alleviating this shortcoming, we present a novel, contrastive learning-based framework for describing player form in the MLB. We use form to refer to the way in which a player has impacted the course of play in their recent appearances. Concretely, a player's form is described by a 72-dimensional vector. By comparing clusters of players resulting from our form representations and those resulting from traditional abermetrics, we demonstrate that our form representations contain information about how players impact the course of play, not present in traditional, publicly available statistics. We believe these embeddings could be utilized to predict both in-game and game-level events, such as the result of an at-bat or the winner of a game.
△ Less
Submitted 11 September, 2021;
originally announced September 2021.
-
Language Models as Emotional Classifiers for Textual Conversations
Authors:
Connor T. Heaton,
David M. Schwartz
Abstract:
Emotions play a critical role in our everyday lives by altering how we perceive, process and respond to our environment. Affective computing aims to instill in computers the ability to detect and act on the emotions of human actors. A core aspect of any affective computing system is the classification of a user's emotion. In this study we present a novel methodology for classifying emotion in a co…
▽ More
Emotions play a critical role in our everyday lives by altering how we perceive, process and respond to our environment. Affective computing aims to instill in computers the ability to detect and act on the emotions of human actors. A core aspect of any affective computing system is the classification of a user's emotion. In this study we present a novel methodology for classifying emotion in a conversation. At the backbone of our proposed methodology is a pre-trained Language Model (LM), which is supplemented by a Graph Convolutional Network (GCN) that propagates information over the predicate-argument structure identified in an utterance. We apply our proposed methodology on the IEMOCAP and Friends data sets, achieving state-of-the-art performance on the former and a higher accuracy on certain emotional labels on the latter. Furthermore, we examine the role context plays in our methodology by altering how much of the preceding conversation the model has access to when making a classification.
△ Less
Submitted 27 August, 2020;
originally announced August 2020.
-
Repurposing TREC-COVID Annotations to Answer the Key Questions of CORD-19
Authors:
Connor T. Heaton,
Prasenjit Mitra
Abstract:
The novel coronavirus disease 2019 (COVID-19) began in Wuhan, China in late 2019 and to date has infected over 14M people worldwide, resulting in over 750,000 deaths. On March 10, 2020 the World Health Organization (WHO) declared the outbreak a global pandemic. Many academics and researchers, not restricted to the medical domain, began publishing papers describing new discoveries. However, with th…
▽ More
The novel coronavirus disease 2019 (COVID-19) began in Wuhan, China in late 2019 and to date has infected over 14M people worldwide, resulting in over 750,000 deaths. On March 10, 2020 the World Health Organization (WHO) declared the outbreak a global pandemic. Many academics and researchers, not restricted to the medical domain, began publishing papers describing new discoveries. However, with the large influx of publications, it was hard for these individuals to sift through the large amount of data and make sense of the findings. The White House and a group of industry research labs, lead by the Allen Institute for AI, aggregated over 200,000 journal articles related to a variety of coronaviruses and tasked the community with answering key questions related to the corpus, releasing the dataset as CORD-19. The information retrieval (IR) community repurposed the journal articles within CORD-19 to more closely resemble a classic TREC-style competition, dubbed TREC-COVID, with human annotators providing relevancy judgements at the end of each round of competition. Seeing the related endeavors, we set out to repurpose the relevancy annotations for TREC-COVID tasks to identify journal articles in CORD-19 which are relevant to the key questions posed by CORD-19. A BioBERT model trained on this repurposed dataset prescribes relevancy annotations for CORD-19 tasks that have an overall agreement of 0.4430 with majority human annotations in terms of Cohen's kappa. We present the methodology used to construct the new dataset and describe the decision process used throughout.
△ Less
Submitted 27 August, 2020;
originally announced August 2020.