-
The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models
Authors:
Jorge Abreu-Vicente,
Hannah Sonntag,
Thomas Eidens,
Thomas Lemberger
Abstract:
Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in…
▽ More
Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts.
Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.
Conclusions: SourceData-NLP's scale highlights the value of integrating curation into publishing. Models trained with SourceData-NLP will furthermore enable the development of tools able to extract causal hypotheses from the literature and assemble them into knowledge graphs.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Fourier-space combination of Planck and Herschel images
Authors:
J. Abreu-Vicente,
A. Stutz,
Th. Henning,
E. Keto,
J. Ballesteros-Paredes,
T. Robitaille
Abstract:
Herschel has revolutionized our ability to measure column densities (N$_{\rm H}$) and temperatures (T) of molecular clouds thanks to its far infrared multiwavelength coverage. However, the lack of a well defined background intensity level in the Herschel data limits the accuracy of the N$_{\rm H}$ and T maps. We provide a method that corrects the missing Herschel background intensity levels using…
▽ More
Herschel has revolutionized our ability to measure column densities (N$_{\rm H}$) and temperatures (T) of molecular clouds thanks to its far infrared multiwavelength coverage. However, the lack of a well defined background intensity level in the Herschel data limits the accuracy of the N$_{\rm H}$ and T maps. We provide a method that corrects the missing Herschel background intensity levels using the Planck model for foreground Galactic thermal dust emission. We present a Fourier method that combines the publicly available Planck model on large angular scales with the Herschel images on smaller angular scales. We apply our method to two regions spanning a range of Galactic environments: Perseus and the Galactic plane region around $l = 11°$ (HiGal--11). We post-process the combined dust continuum emission images to generate column density and temperature maps. We compare these to previously adopted constant--offset corrections. We find significant differences ($\gtrsim$20\%) over significant ($\sim$15\%) areas of the maps, at low column densities ($N_{\rm H}\lesssim10^{22}$\,cm$^{-2}$) and relatively high temperatures ($T\gtrsim20$\,K). We also apply our method to synthetic observations of a simulated molecular cloud to validate our method. Our method successfully corrects the Herschel images, including both the constant--offset intensity level and the scale-dependent background variations measured by Planck. Our method improves the previous constant--offset corrections, which did not account for variations in the background emission levels.
△ Less
Submitted 29 June, 2017; v1 submitted 10 May, 2016;
originally announced May 2016.
-
Resolving the fragmentation of high line-mass filaments with ALMA: the integral shaped filament in Orion A
Authors:
Jouni Kainulainen,
Amelia M. Stutz,
Thomas Stanke,
Jorge Abreu-Vicente,
Henrik Beuther,
Thomas Henning,
Katharine G. Johnston,
Tom Megeath
Abstract:
We study the fragmentation of the nearest high line-mass filament, the integral shaped filament (ISF, line-mass $\sim$ 400 M$_\odot$ pc$^{-1}$) in the Orion A molecular cloud. We have observed a 1.6 pc long section of the ISF with the Atacama Large Millimetre/submillimeter Array (ALMA) at 3 mm continuum emission, at a resolution of $\sim$3" (1 200 AU). We identify from the region 43 dense cores wi…
▽ More
We study the fragmentation of the nearest high line-mass filament, the integral shaped filament (ISF, line-mass $\sim$ 400 M$_\odot$ pc$^{-1}$) in the Orion A molecular cloud. We have observed a 1.6 pc long section of the ISF with the Atacama Large Millimetre/submillimeter Array (ALMA) at 3 mm continuum emission, at a resolution of $\sim$3" (1 200 AU). We identify from the region 43 dense cores with masses about a solar mass. 60% of the ALMA cores are protostellar and 40\% are starless. The nearest neighbour separations of the cores do not show a preferred fragmentation scale; the frequency of short separations increases down to 1 200 AU. We apply a two-point correlation analysis on the dense core separations and show that the ALMA cores are significantly grouped at separations below $\sim$17 000 AU and strongly grouped below $\sim$6 000 AU. The protostellar and starless cores are grouped differently: only the starless cores group strongly below $\sim$6 000 AU. In addition, the spatial distribution of the cores indicates periodic grou** of the cores into groups of $\sim$30 000 AU in size, separated by $\sim$50 000 AU. The groups coincide with dust column density peaks detected by Herschel. These results show hierarchical, two-mode fragmentation in which the maternal filament periodically fragments into groups of dense cores. Critically, our results indicate that the fragmentation models for lower line-mass filaments ($\sim$ 16 M$_\odot$ pc$^{-1}$) fail to capture the observed properties of the ISF. We also find that the protostars identified with Spitzer and Herschel in the ISF are grouped at separations below $\sim$17 000 AU. In contrast, young stars with disks do not show significant grou**. This suggests that the grou** of dense cores is partially retained over the protostar lifetime, but not over the lifetime of stars with disks.
△ Less
Submitted 26 January, 2017; v1 submitted 17 March, 2016;
originally announced March 2016.
-
Giant molecular filaments in the Milky Way II: The fourth Galactic quadrant
Authors:
J. Abreu-Vicente,
S. Ragan,
J. Kainulainen,
Th. Henning,
H. Beuther,
K. Johnston
Abstract:
Filamentary structures are common morphological features of the cold, molecular interstellar medium (ISM). Recent studies have discovered massive, hundred-parsec-scale filaments that may be connected to the large-scale, Galactic spiral arm structure. Addressing the nature of these Giant Molecular Filaments (GMFs) requires a census of their occurrence and properties. We perform a systematic search…
▽ More
Filamentary structures are common morphological features of the cold, molecular interstellar medium (ISM). Recent studies have discovered massive, hundred-parsec-scale filaments that may be connected to the large-scale, Galactic spiral arm structure. Addressing the nature of these Giant Molecular Filaments (GMFs) requires a census of their occurrence and properties. We perform a systematic search of GMFs in the fourth Galactic quadrant and determine their basic physical properties. We identify GMFs based on their dust extinction signatures in near- and mid-infrared and velocity structure probed by ^{13}CO line emission. We use the ^{13}CO line emission and ATLASGAL dust emission data to estimate the total and dense gas masses of the GMFs. We combine our sample with an earlier sample from literature and study the Galactic environment of the GMFs. We identify nine GMFs in the fourth Galactic quadrant; six are located in the Centaurus spiral arm and three in inter-arm regions. Combining this sample with an earlier study using the same identification criteria in the first Galactic quadrant results in 16 GMFs, nine of which are located within spiral arms. The GMFs have sizes of 80-160 pc and ^{13}CO-derived masses between 5-90 x 10^{4} Msun. Their dense gas mass fractions are between 1.5-37%, being higher in the GMFs connected to spiral arms. We also compare the different GMF-identification methods and find that emission and extinction based techniques overlap only partially, highlighting the need to use both to achieve a complete census.
△ Less
Submitted 25 April, 2016; v1 submitted 17 March, 2016;
originally announced March 2016.
-
Relationship between the column density distribution and evolutionary class of molecular clouds as viewed by ATLASGAL
Authors:
J. Abreu-Vicente,
J. Kainulainen,
A. Stutz,
T. Henning,
H. Beuther
Abstract:
We present the first study of the relationship between the column density distribution of molecular clouds within nearby Galactic spiral arms and their evolutionary status as measured from their stellar content. We analyze a sample of 195 molecular clouds located at distances below 5.5 kpc, identified from the ATLASGAL 870 micron data. We define three evolutionary classes within this sample: starl…
▽ More
We present the first study of the relationship between the column density distribution of molecular clouds within nearby Galactic spiral arms and their evolutionary status as measured from their stellar content. We analyze a sample of 195 molecular clouds located at distances below 5.5 kpc, identified from the ATLASGAL 870 micron data. We define three evolutionary classes within this sample: starless clumps, star-forming clouds with associated young stellar objects, and clouds associated with HII regions. We find that the N(H2) probability density functions (N-PDFs) of these three classes of objects are clearly different: the N-PDFs of starless clumps are narrowest and close to log-normal in shape, while star-forming clouds and HII regions exhibit a power-law shape over a wide range of column densities and log-normal-like components only at low column densities. We use the N-PDFs to estimate the evolutionary time-scales of the three classes of objects based on a simple analytic model from literature. Finally, we show that the integral of the N-PDFs, the dense gas mass fraction, depends on the total mass of the regions as measured by ATLASGAL: more massive clouds contain greater relative amounts of dense gas across all evolutionary classes.
△ Less
Submitted 2 July, 2015;
originally announced July 2015.
-
Gas and dust cooling along the major axis of M33 (HerM33es): ISO/LWS CII observations
Authors:
C. Kramer,
J. Abreu-Vicente,
S. Garcia-Burillo,
M. Relano,
S. Aalto,
M. Boquien,
J. Braine,
C. Buchbender,
P. Gratier,
F. P. Israel,
T. Nikola,
M. Roellig,
S. Verley,
P. van der Werf,
E. M. Xilouris
Abstract:
We aim to better understand the heating of the gas by observing the prominent gas cooling line [CII] at 158um in the low-metallicity environment of the Local Group spiral galaxy M33 at scales of 280pc. In particular, we aim at describing the variation of the photoelectric heating efficiency with galactic environment. In this unbiased study, we used ISO/LWS [CII] observations along the major axis o…
▽ More
We aim to better understand the heating of the gas by observing the prominent gas cooling line [CII] at 158um in the low-metallicity environment of the Local Group spiral galaxy M33 at scales of 280pc. In particular, we aim at describing the variation of the photoelectric heating efficiency with galactic environment. In this unbiased study, we used ISO/LWS [CII] observations along the major axis of M33, in combination with Herschel PACS and SPIRE continuum maps, IRAM 30m CO 2-1 and VLA HI data to study the variation of velocity integrated intensities. The ratio of [CII] emission over the far-infrared continuum is used as a proxy for the heating efficiency, and models of photon-dominated regions are used to study the local physical densities, FUV radiation fields, and average column densities of the molecular clouds. The heating efficiency stays constant at 0.8% in the inner 4.5kpc radius of the galaxy where it starts to increase to reach values of ~3% in the outskirts at about 6kpc radial distance. The rise of efficiency is explained in the framework of PDR models by lowered volume densities and FUV fields, for optical extinctions of only a few magnitudes at constant metallicity. In view of the significant fraction of HI emission stemming from PDRs, and for typical pressures found in the Galactic cold neutral medium (CNM) traced by HI emission, the CNM contributes ~15% to the observed [CII] emission in the inner 2kpc radius of M33. The CNM contribution remains largely undetermined in the south, while positions between 2 and 7.3kpc radial distance in the north of M33 show a contribution of ~40%+-20%.
△ Less
Submitted 31 March, 2013;
originally announced April 2013.
-
Spectral Energy Distributions of HII regions in M33 (HerM33es)
Authors:
M. Relano,
S. Verley,
I. Perez,
C. Kramer,
D. Calzetti,
E. M. Xilouris,
M. Boquien,
J. Abreu-Vicente,
F. Combes,
F. Israel,
F. S. Tabatabaei,
J. Braine,
C. Buchbender,
M. Gonzalez,
P. Gratier,
S. Lord,
B. Mookerjea,
G. Quintana-Lacaci,
P. van der Werf
Abstract:
Within the framework of the Herschel M 33 extended survey HerM33es we study the Spectral Energy Distribution (SED) of a set of HII regions in M 33 as a function of the morphology. We present a catalogue of 119 HII regions morphologically classified: 9 filled, 47 mixed, 36 shell, and 27 clear shell HII regions. For each object we extract the photometry at twelve available wavelength bands (from FUV…
▽ More
Within the framework of the Herschel M 33 extended survey HerM33es we study the Spectral Energy Distribution (SED) of a set of HII regions in M 33 as a function of the morphology. We present a catalogue of 119 HII regions morphologically classified: 9 filled, 47 mixed, 36 shell, and 27 clear shell HII regions. For each object we extract the photometry at twelve available wavelength bands (from FUV-1516A to IR-250mi) and obtain the SED. We also obtain emission line profiles across the regions to study the location of the stellar, ionised gas, and dust components. We find trends for the SEDs related to the morphology, showing that the star and gas-dust configuration affects the ratios of the emission in different bands. The mixed and filled regions show higher emission at 24mi than the shells and clear shells, which could be due to the proximity of the dust to the stellar clusters in the case of mixed and filled regions. The FIR peak for shells and clear shells seems to be located towards longer wavelengths, indicating that the dust is colder for this type of objects.The logarithmic 100/70mi ratio for filled and mixed regions remains constant over one order of magnitude in Halpha and FUV surface brightness, while the shells and clear shells exhibit a wider range of values of almost two orders of magnitude. We derive dust masses and temperatures fitting the individual SEDs with dust models proposed in the literature. The derived dust mass range is between 10^2-10^4 Msun and the cold dust temperature spans T(cold)~12-27 K. The spherical geometrical model proposed for the Halpha clear shells is confirmed by the emission profile obtained from the observations and is used to infer the electron density within the envelope: the typical electron density is 0.7+-0.3 cm^-3, while filled regions can reach values two to five times higher.
△ Less
Submitted 24 January, 2013;
originally announced January 2013.