-
Experimenting with Large Language Models and vector embeddings in NASA SciX
Authors:
Sergi Blanco-Cuaresma,
Ioana Ciucă,
Alberto Accomazzi,
Michael J. Kurtz,
Edwin A. Henneken,
Kelly E. Lockhart,
Felix Grezes,
Thomas Allen,
Golnaz Shapurian,
Carolyn S. Grant,
Donna M. Thompson,
Timothy W. Hostetler,
Matthew R. Templeton,
Shinyi Chen,
Jennifer Koch,
Taylor Jacovich,
Daniel Chivvis,
Fernanda de Macedo Alves,
Jean-Claude Paquin,
Jennifer Bartlett,
Mugdha Polimera,
Stephanie Jarmak
Abstract:
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed a…
▽ More
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Improving astroBERT using Semantic Textual Similarity
Authors:
Felix Grezes,
Thomas Allen,
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Golnaz Shapurian,
Edwin Henneken,
Carolyn S. Grant,
Donna M. Thompson,
Timothy W. Hostetler,
Matthew R. Templeton,
Kelly E. Lockhart,
Shinyi Chen,
Jennifer Koch,
Taylor Jacovich,
Pavlos Protopapas
Abstract:
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we:
- announce the first…
▽ More
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we:
- announce the first public release of the astroBERT language model;
- show how astroBERT improves over existing public language models on astrophysics specific tasks;
- and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.
△ Less
Submitted 29 November, 2022;
originally announced December 2022.
-
Web accessibility trends and implementation in dynamic web applications
Authors:
Timothy W. Hostetler,
Shinyi Chen,
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Carolyn S. Grant,
Edwin Henneken,
Donna M. Thompson,
Roman Chyla,
Golnaz Shapurian,
Matthew R. Templeton,
Kelly E. Lockhart,
Nemanja Martinovic,
Stephen McDonald,
Felix Grezes
Abstract:
The NASA Astrophysics Data System (ADS), a critical research service for the astrophysics community, strives to provide the most accessible and inclusive environment for the discovery and exploration of the astronomical literature. Part of this goal involves creating a digital platform that can accommodate everybody, including those with disabilities that would benefit from alternative ways to pre…
▽ More
The NASA Astrophysics Data System (ADS), a critical research service for the astrophysics community, strives to provide the most accessible and inclusive environment for the discovery and exploration of the astronomical literature. Part of this goal involves creating a digital platform that can accommodate everybody, including those with disabilities that would benefit from alternative ways to present the information provided by the website. NASA ADS follows the official Web Content Accessibility Guidelines (WCAG) standard for ensuring accessibility of all its applications, striving to exceed this standard where possible. Through the use of both internal audits and external expert review based on these guidelines, we have identified many areas for improving accessibility in our current web application, and have implemented a number of updates to the UI as a result of this. We present an overview of some current web accessibility trends, discuss our experience incorporating these trends in our web application, and discuss the lessons learned and recommendations for future projects.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Building astroBERT, a language model for Astronomy & Astrophysics
Authors:
Felix Grezes,
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Golnaz Shapurian,
Edwin Henneken,
Carolyn S. Grant,
Donna M. Thompson,
Roman Chyla,
Stephen McDonald,
Timothy W. Hostetler,
Matthew R. Templeton,
Kelly E. Lockhart,
Nemanja Martinovic,
Shinyi Chen,
Chris Tanner,
Pavlos Protopapas
Abstract:
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and…
▽ More
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are develo** our own named entity recognition tool. We present here our preliminary results and lessons learned.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
Agile methodologies in teams with highly creative and autonomous members
Authors:
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Edwin Henneken,
Carolyn S. Grant,
Donna M. Thompson,
Roman Chyla,
Stephen McDonald,
Golnaz Shapurian,
Timothy W. Hostetler,
Matthew R. Templeton,
Kelly E. Lockhart,
Kris Bukovi
Abstract:
The Agile manifesto encourages us to value individuals and interactions over processes and tools, while Scrum, the most adopted Agile development methodology, is essentially based on roles, events, artifacts, and the rules that bind them together (i.e., processes). Moreover, it is generally proclaimed that whenever a Scrum project does not succeed, the reason is because Scrum was not implemented c…
▽ More
The Agile manifesto encourages us to value individuals and interactions over processes and tools, while Scrum, the most adopted Agile development methodology, is essentially based on roles, events, artifacts, and the rules that bind them together (i.e., processes). Moreover, it is generally proclaimed that whenever a Scrum project does not succeed, the reason is because Scrum was not implemented correctly and not because Scrum may have its own flaws. This grants irrefutability to the methodology, discouraging deviations to fit the actual needs and peculiarities of the developers. In particular, the members of the NASA ADS team are highly creative and autonomous whose motivation can be affected if their freedom is too strongly constrained. We present our experience following Agile principles, reusing certain Scrum elements and seeking the satisfaction of the team members, while rapidly reacting/kee** the project in line with our stakeholders expectations.
△ Less
Submitted 10 September, 2020;
originally announced September 2020.
-
SPISEA: A Python-Based Simple Stellar Population Synthesis Code for Star Clusters
Authors:
M. W. Hosek Jr,
J. R. Lu,
C. Y. Lam,
A. K. Gautam,
K. E. Lockhart,
D. Kim,
S. Jia
Abstract:
We present SPISEA (Stellar Population Interface for Stellar Evolution and Atmospheres), an open-source Python package that simulates simple stellar populations. The strength of SPISEA is its modular interface which offers the user control of 13 input properties including (but not limited to) the Initial Mass Function, stellar multiplicity, extinction law, and the metallicity-dependent stellar evol…
▽ More
We present SPISEA (Stellar Population Interface for Stellar Evolution and Atmospheres), an open-source Python package that simulates simple stellar populations. The strength of SPISEA is its modular interface which offers the user control of 13 input properties including (but not limited to) the Initial Mass Function, stellar multiplicity, extinction law, and the metallicity-dependent stellar evolution and atmosphere model grids used. The user also has control over the Initial-Final Mass Relation in order to produce compact stellar remnants (black holes, neutron stars, and white dwarfs). We demonstrate several outputs produced by the code, including color-magnitude diagrams, HR-diagrams, luminosity functions, and mass functions. SPISEA is object-oriented and extensible, and we welcome contributions from the community. The code and documentation are available on GitHub and ReadtheDocs, respectively.
△ Less
Submitted 10 July, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Referencing Sources of Molecular Spectroscopic Data in the Era of Data Science: Application to the HITRAN and AMBDAS Databases
Authors:
Frances M. Skinner,
Iouli E. Gordon,
Christian Hill,
Robert J. Hargreaves,
Kelly E. Lockhart,
Laurence S. Rothman
Abstract:
The application described has been designed to create bibliographic entries in large databases with diverse sources automatically, which reduces both the frequency of mistakes and the workload for the administrators. This new system uniquely identifies each reference from its digital object identifier (DOI) and retrieves the corresponding bibliographic information from any of several online servic…
▽ More
The application described has been designed to create bibliographic entries in large databases with diverse sources automatically, which reduces both the frequency of mistakes and the workload for the administrators. This new system uniquely identifies each reference from its digital object identifier (DOI) and retrieves the corresponding bibliographic information from any of several online services, including the SAO/NASA Astrophysics Data Systems (ADS) and CrossRef APIs. Once parsed into a relational database, the software is able to produce bibliographies in any of several formats, including HTML and BibTeX, for use on websites or printed articles. The application is provided free-of-charge for general use by any scientific database. The power of this application is demonstrated when used to populate reference data for the HITRAN and AMBDAS databases as test cases. HITRAN contains data that is provided by researchers and collaborators throughout the spectroscopic community. These contributors are accredited for their contributions through the bibliography produced alongside the data returned by an online search in HITRAN. Prior to the work presented here, HITRAN and AMBDAS created these bibliographies manually, which is a tedious, time-consuming and error-prone process. The complete code for the new referencing system can be found at \url{https://github.com/hitranonline/refs}.
△ Less
Submitted 15 May, 2020;
originally announced May 2020.
-
Fundamentals of effective cloud management for the new NASA Astrophysics Data System
Authors:
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Edwin Henneken,
Carolyn S. Grant,
Donna M. Thompson,
Roman Chyla,
Stephen McDonald,
Golnaz Shapurian,
Timothy W. Hostetler,
Matthew R. Templeton,
Kelly E. Lockhart,
Kris Bukovi,
Nathan Rapport
Abstract:
The new NASA Astrophysics Data System (ADS) is designed with a serviceoriented architecture (SOA) that consists of multiple customized Apache Solr search engine instances plus a collection of microservices, containerized using Docker, and deployed in Amazon Web Services (AWS). For complex systems, like the ADS, this loosely coupled architecture can lead to a more scalable, reliable and resilient s…
▽ More
The new NASA Astrophysics Data System (ADS) is designed with a serviceoriented architecture (SOA) that consists of multiple customized Apache Solr search engine instances plus a collection of microservices, containerized using Docker, and deployed in Amazon Web Services (AWS). For complex systems, like the ADS, this loosely coupled architecture can lead to a more scalable, reliable and resilient system if some fundamental questions are addressed. After having experimented with different AWS environments and deployment methods, we decided in December 2017 to go with Kubernetes as our container orchestration. Defining the best strategy to properly setup Kubernetes has shown to be challenging: automatic scaling services and load balancing traffic can lead to errors whose origin is difficult to identify, monitoring and logging the activity that happens across multiple layers for a single request needs to be carefully addressed, and the best workflow for a Continuous Integration and Delivery (CI/CD) system is not self-evident. We present here how we tackle these challenges and our plans for the future.
△ Less
Submitted 16 January, 2019;
originally announced January 2019.
-
astroquery: An Astronomical Web-Querying Package in Python
Authors:
Adam Ginsburg,
Brigitta M. Sipőcz,
C. E. Brasseur,
Philip S. Cowperthwaite,
Matthew W. Craig,
Christoph Deil,
James Guillochon,
Giannina Guzman,
Simon Liedtke,
Pey Lian Lim,
Kelly E. Lockhart,
Michael Mommert,
Brett M. Morris,
Henrik Norman,
Madhura Parikh,
Magnus V. Persson,
Thomas P. Robitaille,
Juan-Carlos Segovia,
Leo P. Singer,
Erik J. Tollerud,
Miguel de Val-Borro,
Ivan Valtchanov,
Julien Woillez,
the Astroquery collaboration
Abstract:
astroquery is a collection of tools for requesting data from databases hosted on remote servers with interfaces exposed on the internet, including those with web pages but without formal application program interfaces (APIs). These tools are built on the Python requests package, which is used to make HTTP requests, and astropy, which provides most of the data parsing functionality. astroquery modu…
▽ More
astroquery is a collection of tools for requesting data from databases hosted on remote servers with interfaces exposed on the internet, including those with web pages but without formal application program interfaces (APIs). These tools are built on the Python requests package, which is used to make HTTP requests, and astropy, which provides most of the data parsing functionality. astroquery modules generally attempt to replicate the web page interface provided by a given service as closely as possible, making the transition from browser-based to command-line interaction easy. astroquery has received significant contributions from throughout the astronomical community, including several significant contributions from telescope archives. astroquery enables the creation of fully reproducible workflows from data acquisition through publication. This paper describes the philosophy, basic structure, and development model of the astroquery package. The complete documentation for astroquery can be found at http://astroquery.readthedocs.io/.
△ Less
Submitted 14 January, 2019;
originally announced January 2019.
-
Characterizing and Improving the Data Reduction Pipeline for the Keck OSIRIS Integral Field Spectrograph
Authors:
Kelly E. Lockhart,
Tuan Do,
James E. Larkin,
Anna Boehle,
Randy D. Campbell,
Samantha Chappell,
Devin Chu,
Anna Ciurlo,
Maren Cosens,
Michael P. Fitzgerald,
Andrea Ghez,
Jessica R. Lu,
Jim E. Lyke,
Etsuko Mieda,
Alexander R. Rudy,
Andrey Vayner,
Gregory Walth,
Shelley A. Wright
Abstract:
OSIRIS is a near-infrared (1.0--2.4 $μ$m) integral field spectrograph operating behind the adaptive optics system at Keck Observatory, and is one of the first lenslet-based integral field spectrographs. Since its commissioning in 2005, it has been a productive instrument, producing nearly half the laser guide star adaptive optics (LGS AO) papers on Keck. The complexity of its raw data format neces…
▽ More
OSIRIS is a near-infrared (1.0--2.4 $μ$m) integral field spectrograph operating behind the adaptive optics system at Keck Observatory, and is one of the first lenslet-based integral field spectrographs. Since its commissioning in 2005, it has been a productive instrument, producing nearly half the laser guide star adaptive optics (LGS AO) papers on Keck. The complexity of its raw data format necessitated a custom data reduction pipeline (DRP) delivered with the instrument in order to iteratively assign flux in overlap** spectra to the proper spatial and spectral locations in a data cube. Other than bug fixes and updates required for hardware upgrades, the bulk of the DRP has not been updated since initial instrument commissioning. We report on the first major comprehensive characterization of the DRP using on-sky and calibration data. We also detail improvements to the DRP including characterization of the flux assignment algorithm; exploration of spatial rippling in the reduced data cubes; and improvements to several calibration files, including the rectification matrix, the bad pixel mask, and the wavelength solution. We present lessons learned from over a decade of OSIRIS data reduction that are relevant to the next generation of integral field spectrograph hardware and data reduction software design.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
A Slowly Precessing Disk in the Nucleus of M31 as the Feeding Mechanism for a Central Starburst
Authors:
K. E. Lockhart,
J. R. Lu,
H. V. Peiris,
R. M. Rich,
A. Bouchez,
A. M. Ghez
Abstract:
We present a kinematic study of the nuclear stellar disk in M31 at infrared wavelengths using high spatial resolution integral field spectroscopy. The spatial resolution achieved, FWHM = 0."12 (0.45 pc at the distance of M31), has only previously been equaled in spectroscopic studies by space-based long-slit observations. Using adaptive optics-corrected integral field spectroscopy from the OSIRIS…
▽ More
We present a kinematic study of the nuclear stellar disk in M31 at infrared wavelengths using high spatial resolution integral field spectroscopy. The spatial resolution achieved, FWHM = 0."12 (0.45 pc at the distance of M31), has only previously been equaled in spectroscopic studies by space-based long-slit observations. Using adaptive optics-corrected integral field spectroscopy from the OSIRIS instrument at the W. M. Keck Observatory, we map the line-of-sight kinematics over the entire old stellar eccentric disk orbiting the supermassive black hole (SMBH) at a distance of r<4 pc. The peak velocity dispersion is 381+/-55 km/s , offset by 0.13 +/- 0.03 from the SMBH, consistent with previous high-resolution long-slit observations. There is a lack of near-infrared (NIR) emission at the position of the SMBH and young nuclear cluster, suggesting a spatial separation between the young and old stellar populations within the nucleus. We compare the observed kinematics with dynamical models from Peiris & Tremaine (2003). The best-fit disk orientation to the NIR flux is [$θ_l$, $θ_i$, $θ_a$] = [-33 +/- 4$^{\circ}$, 44 +/- 2$^{\circ}$, -15 +/- 15$^{\circ}$], which is tilted with respect to both the larger-scale galactic disk and the best-fit orientation derived from optical observations. The precession rate of the old disk is $Ω_P$ = 0.0 +/- 3.9 km/s/pc, lower than the majority of previous observations. This slow precession rate suggests that stellar winds from the disk will collide and shock, driving rapid gas inflows and fueling an episodic central starburst as suggested in Chang et al. (2007).
△ Less
Submitted 23 January, 2018; v1 submitted 3 October, 2017;
originally announced October 2017.
-
HST/WFC3 Observations of an Off-Nuclear Superbubble in Arp 220
Authors:
Kelly E. Lockhart,
Lisa J. Kewley,
Jessica R. Lu,
Mark G. Allen,
David Rupke,
Daniela Calzetti,
Richard I. Davies,
Michael A. Dopita,
Hauke Engel,
Timothy M. Heckman,
Claus Leitherer,
David B. Sanders
Abstract:
We present a high spatial resolution optical and infrared study of the circumnuclear region in Arp 220, a late-stage galaxy merger. Narrowband imaging using HST/WFC3 has resolved the previously observed peak in H$α$+[NII] emission into a bubble-shaped feature. This feature measures 1.6" in diameter, or 600 pc, and is only 1" northwest of the western nucleus. The bubble is aligned with the western…
▽ More
We present a high spatial resolution optical and infrared study of the circumnuclear region in Arp 220, a late-stage galaxy merger. Narrowband imaging using HST/WFC3 has resolved the previously observed peak in H$α$+[NII] emission into a bubble-shaped feature. This feature measures 1.6" in diameter, or 600 pc, and is only 1" northwest of the western nucleus. The bubble is aligned with the western nucleus and the large-scale outflow axis seen in X-rays. We explore several possibilities for the bubble origin, including a jet or outflow from a hidden active galactic nucleus (AGN), outflows from high levels of star formation within the few hundred pc nuclear gas disk, or an ultraluminous X-ray source. An obscured AGN or high levels of star formation within the inner $\sim$100 pc of the nuclei are favored based on the alignment of the bubble and energetics arguments.
△ Less
Submitted 9 June, 2015;
originally announced June 2015.
-
Stellar and Circumstellar Properties of Class I Protostars
Authors:
L. Prato,
K. E. Lockhart,
Christopher M. Johns-Krull,
John T. Rayner
Abstract:
We present a study of the stellar and circumstellar properties of Class I sources using low-resolution (R~1000) near-infrared K- and L-band spectroscopy. We measure prominent spectral lines and features in 8 objects and use fits to standard star spectra to determine spectral types, visual extinctions, K-band excesses, and water ice optical depths. Four of the seven systems studied are close bina…
▽ More
We present a study of the stellar and circumstellar properties of Class I sources using low-resolution (R~1000) near-infrared K- and L-band spectroscopy. We measure prominent spectral lines and features in 8 objects and use fits to standard star spectra to determine spectral types, visual extinctions, K-band excesses, and water ice optical depths. Four of the seven systems studied are close binary pairs; only one of these systems, Haro 6-10, was angularly resolvab le. For certain stars some properties found in our analysis differ substantially from published values; we analyze the origin of these differences. We determine extinction to each source using three different methods and compare and discuss the resulting values. One hypothesis that we were testing, that extinction dominates over the K-band excess in obscuration of the stellar photospheric absorption lines, appears not to be true. Accretion luminosities and mass accretion rates calculated for our targets are highly uncertain, in part the result of our inexact knowledge of extinction. For the six targets we were able to place on an H-R diagram, our age estimates, <2 Myr, are somewhat younger than those from comparable studies. Our results underscore the value of low-resolution spectroscopy in the study of protostars and their environments; however, the optimal approach to the study of Class I sources likely involves a combination of high- and low-resolution near-infrared, mid-infrared, and millimeter wavelength observations. Accurate and precise measurements of extinction in Class I protostars will be key to improving our understanding of these objects.
△ Less
Submitted 7 February, 2009;
originally announced February 2009.