Search | arXiv e-print repository

How Effective are State Space Models for Machine Translation?

Authors: Hugo Pitorro, Pavlo Vasylenko, Marcos Treviso, André F. T. Martins

Abstract: Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts. Recent works propose to replace attention with linear recurrent layers -- this is the case for state space models, which enjoy efficient training and inference. However, it remains unclear whether these models are competitive with transformers in machine translation (MT). In… ▽ More Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts. Recent works propose to replace attention with linear recurrent layers -- this is the case for state space models, which enjoy efficient training and inference. However, it remains unclear whether these models are competitive with transformers in machine translation (MT). In this paper, we provide a rigorous and comprehensive experimental comparison between transformers and linear recurrent models for MT. Concretely, we experiment with RetNet, Mamba, and hybrid versions of Mamba which incorporate attention mechanisms. Our findings demonstrate that Mamba is highly competitive with transformers on sentence and paragraph-level datasets, where in the latter both models benefit from shifting the training distribution towards longer sequences. Further analysis show that integrating attention into Mamba improves translation quality, robustness to sequence length extrapolation, and the ability to recall named entities. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.03137 [pdf, other]

X-Shooting ULLYSES: Massive Stars at low metallicity -- IV. Spectral analysis methods and exemplary results for O stars

Authors: A. A. C. Sander, J. -C. Bouret, M. Bernini-Peron, J. Puls, F. Backs, S. R. Berlanas, J. M. Bestenlehner, S. A. Brands, A. Herrero, F. Martins, O. Maryeva, D. Pauli, V. Ramachandran, P. A. Crowther, V. M. A. Gómez-González, A. C. Gormaz-Matamala, W. -R. Hamann, D. J. Hillier, R. Kuiper, C. J. K. Larkin, R. R. Lefever, A. Mehner, F. Najarro, L. M. Oskinova, E. C. Schösser , et al. (4 additional authors not shown)

Abstract: CONTEXT: The spectral analysis of hot, massive stars is a fundamental astrophysical method to obtain their intrinsic properties and their feedback. Quantitative spectroscopy for hot, massive stars requires detailed numerical modeling of the atmosphere and an iterative treatment to obtain the best solution within a given framework. AIMS: We present an overview of different techniques for the quanti… ▽ More CONTEXT: The spectral analysis of hot, massive stars is a fundamental astrophysical method to obtain their intrinsic properties and their feedback. Quantitative spectroscopy for hot, massive stars requires detailed numerical modeling of the atmosphere and an iterative treatment to obtain the best solution within a given framework. AIMS: We present an overview of different techniques for the quantitative spectroscopy of hot stars employed within the X-Shooting ULLYSES collaboration, from grid-based approaches to tailored fits. By performing a blind test, we gain an overview about the similarities and differences of the resulting parameters. Our study aims to provide an overview of the parameter spread caused by different approaches. METHODS: For three different stars from the sample (SMC O5 star AzV 377, LMC O7 star Sk -69 50, and LMC O9 star Sk -66 171), we employ different atmosphere codes (CMFGEN, Fastwind, PoWR) and strategies to determine their best-fitting model. For our analyses, UV and optical spectra are used to derive the properties with some methods relying purely on optical data for comparison. To determine the overall spectral energy distribution, we further employ additional photometry from the literature. RESULTS: Effective temperatures for each of three sample stars agree within 3 kK while the differences in log g can be up to 0.2 dex. Luminosity differences of up to 0.1 dex result from different reddening assumptions, which seem to be larger for the methods employing a genetic algorithm. All sample stars are nitrogen-enriched. CONCLUSIONS: We find a reasonable agreement between the different methods. Tailored fitting tends to be able to minimize discrepancies obtained with more course or automatized treatments. UV spectral data is essential for the determination of realistic wind parameters. For one target (Sk -69 50), we find clear indications of an evolved status. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 18+15 pages, 21+4 figures, under review at A&A, condensed abstract

arXiv:2407.00436 [pdf, other]

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Authors: Peiqin Lin, André F. T. Martins, Hinrich Schütze

Abstract: Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate… ▽ More Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19482 [pdf, other]

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

Authors: Marcos Treviso, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, José Pombal, Tania Vaz, Helena Wu, Beatriz Silva, Daan van Stigt, André F. T. Martins

Abstract: While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for tr… ▽ More While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for translation errors in order to guide the generation of a corrected translation. The quality of the generated explanations by xTower are assessed via both intrinsic and extrinsic evaluation. We ask expert translators to evaluate the quality of the explanations across two dimensions: relatedness towards the error span being explained and helpfulness in error understanding and improving translation quality. Extrinsically, we test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality. Our findings highlight xTower's potential towards not only producing plausible and helpful explanations of automatic translations, but also leveraging them to suggest corrected translations. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.18403 [pdf, other]

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Authors: Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni

Abstract: There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human anno… ▽ More There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.10913 [pdf, other]

Minimal evolution times for fast, pulse-based state preparation in silicon spin qubits

Authors: Christopher K. Long, Nicholas J. Mayhall, Sophia E. Economou, Edwin Barnes, Crispin H. W. Barnes, Frederico Martins, David R. M. Arvidsson-Shukur, Normann Mertig

Abstract: Standing as one of the most significant barriers to reaching quantum advantage, state-preparation fidelities on noisy intermediate-scale quantum processors suffer from quantum-gate errors, which accumulate over time. A potential remedy is pulse-based state preparation. We numerically investigate the minimal evolution times (METs) attainable by optimizing (microwave and exchange) pulses on silicon… ▽ More Standing as one of the most significant barriers to reaching quantum advantage, state-preparation fidelities on noisy intermediate-scale quantum processors suffer from quantum-gate errors, which accumulate over time. A potential remedy is pulse-based state preparation. We numerically investigate the minimal evolution times (METs) attainable by optimizing (microwave and exchange) pulses on silicon hardware. We investigate two state preparation tasks. First, we consider the preparation of molecular ground states and find the METs for H$_2$, HeH$^+$, and LiH to be 2.4 ns, 4.4 ns, and 27.2 ns, respectively. Second, we consider transitions between arbitrary states and find the METs for transitions between arbitrary four-qubit states to be below 50 ns. For comparison, connecting arbitrary two-qubit states via one- and two-qubit gates on the same silicon processor requires approximately 200 ns. This comparison indicates that pulse-based state preparation is likely to utilize the coherence times of silicon hardware more efficiently than gate-based state preparation. Finally, we quantify the effect of silicon device parameters on the MET. We show that increasing the maximal exchange amplitude from 10 MHz to 1 GHz accelerates the METs, e.g., for H$_2$ from 84.3 ns to 2.4 ns. This demonstrates the importance of fast exchange. We also show that increasing the maximal amplitude of the microwave drive from 884 kHz to 56.6 MHz shortens state transitions, e.g., for two-qubit states from 1000 ns to 25 ns. Our results bound both the state-preparation times for general quantum algorithms and the execution times of variational quantum algorithms with silicon spin qubits. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 9 + (7) pages, 6 figs, comments are welcomed

arXiv:2406.09689 [pdf, other]

Physical networks become what they learn

Authors: Menachem Stern, Marcelo Guzman, Felipe Martins, Andrea J Liu, Vijay Balasubramanian

Abstract: Physical networks can develop diverse responses, or functions, by design, evolution or learning. We focus on electrical networks of nodes connected by resistive edges. Such networks can learn by adapting edge conductances to lower a cost function that penalizes deviations from a desired response. The network must also satisfy Kirchhoff's law, balancing currents at nodes, or, equivalently, minimizi… ▽ More Physical networks can develop diverse responses, or functions, by design, evolution or learning. We focus on electrical networks of nodes connected by resistive edges. Such networks can learn by adapting edge conductances to lower a cost function that penalizes deviations from a desired response. The network must also satisfy Kirchhoff's law, balancing currents at nodes, or, equivalently, minimizing total power dissipation by adjusting node voltages. The adaptation is thus a double optimization process, in which a cost function is minimized with respect to conductances, while dissipated power is minimized with respect to node voltages. Here we study how this physical adaptation couples the cost landscape, the landscape of the cost function in the high-dimensional space of edge conductances, to the physical landscape, the dissipated power in the high-dimensional space of node voltages. We show how adaptation links the physical and cost Hessian matrices, suggesting that the physical response of networks to perturbations holds significant information about the functions to which they are adapted. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 6 pages, 2 figures

arXiv:2406.00049 [pdf, other]

QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation

Authors: Gonçalo R. A. Faria, Sweta Agrawal, António Farinhas, Ricardo Rei, José G. C. de Souza, André F. T. Martins

Abstract: An important challenge in machine translation (MT) is to generate high-quality and diverse translations. Prior work has shown that the estimated likelihood from the MT model correlates poorly with translation quality. In contrast, quality evaluation metrics (such as COMET or BLEURT) exhibit high correlations with human judgments, which has motivated their use as rerankers (such as quality-aware an… ▽ More An important challenge in machine translation (MT) is to generate high-quality and diverse translations. Prior work has shown that the estimated likelihood from the MT model correlates poorly with translation quality. In contrast, quality evaluation metrics (such as COMET or BLEURT) exhibit high correlations with human judgments, which has motivated their use as rerankers (such as quality-aware and minimum Bayes risk decoding). However, relying on a single translation with high estimated quality increases the chances of "gaming the metric''. In this paper, we address the problem of sampling a set of high-quality and diverse translations. We provide a simple and effective way to avoid over-reliance on noisy quality estimates by using them as the energy function of a Gibbs distribution. Instead of looking for a mode in the distribution, we generate multiple samples from high-density areas through the Metropolis-Hastings algorithm, a simple Markov chain Monte Carlo approach. The results show that our proposed method leads to high-quality and diverse outputs across multiple language pairs (English$\leftrightarrow${German, Russian}) with two strong decoder-only LLMs (Alma-7b, Tower-7b). △ Less

Submitted 28 May, 2024; originally announced June 2024.

arXiv:2405.18348 [pdf, other]

Can Automatic Metrics Assess High-Quality Translations?

Authors: Sweta Agrawal, António Farinhas, Ricardo Rei, André F. T. Martins

Abstract: Automatic metrics for evaluating translation quality are typically validated by measuring how well they correlate with human assessments. However, correlation methods tend to capture only the ability of metrics to differentiate between good and bad source-translation pairs, overlooking their reliability in distinguishing alternative translations for the same source. In this paper, we confirm that… ▽ More Automatic metrics for evaluating translation quality are typically validated by measuring how well they correlate with human assessments. However, correlation methods tend to capture only the ability of metrics to differentiate between good and bad source-translation pairs, overlooking their reliability in distinguishing alternative translations for the same source. In this paper, we confirm that this is indeed the case by showing that current metrics are insensitive to nuanced differences in translation quality. This effect is most pronounced when the quality is high and the variance among alternatives is low. Given this finding, we shift towards detecting high-quality correct translations, an important problem in practical decision-making scenarios where a binary check of correctness is prioritized over a nuanced evaluation of quality. Using the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors as marked by humans. Our findings reveal that current metrics often over or underestimate translation quality, indicating significant room for improvement in automatic evaluation methods. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: work in progress

arXiv:2405.05116 [pdf, other]

XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples

Authors: Peiqin Lin, André F. T. Martins, Hinrich Schütze

Abstract: Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMP… ▽ More Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever based on Glot500, a multilingual small language model, using positive and negative English examples constructed from the predictions of a multilingual large language model, i.e., MaLA500. Leveraging the cross-lingual capacity of the retriever, it can directly retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on the multilingual text classification benchmark SIB200 with 176 languages show that XAMPLER substantially improves the in-context learning performance across languages. Our code is available at \url{https://github.com/cisnlp/XAMPLER}. △ Less

Submitted 29 June, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.01976 [pdf, other]

Conformal Prediction for Natural Language Processing: A Survey

Authors: Margarida M. Campos, António Farinhas, Chrysoula Zerva, Mário A. T. Figueiredo, André F. T. Martins

Abstract: The rapid proliferation of large language models and natural language processing (NLP) applications creates a crucial need for uncertainty quantification to mitigate risks such as hallucinations and to enhance decision-making reliability in critical applications. Conformal prediction is emerging as a theoretically sound and practically useful framework, combining flexibility with strong statistica… ▽ More The rapid proliferation of large language models and natural language processing (NLP) applications creates a crucial need for uncertainty quantification to mitigate risks such as hallucinations and to enhance decision-making reliability in critical applications. Conformal prediction is emerging as a theoretically sound and practically useful framework, combining flexibility with strong statistical guarantees. Its model-agnostic and distribution-free nature makes it particularly promising to address the current shortcomings of NLP systems that stem from the absence of uncertainty quantification. This paper provides a comprehensive survey of conformal prediction techniques, their guarantees, and existing applications in NLP, pointing to directions for future research and open challenges. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.01267 [pdf, ps, other]

X-Shooting ULLYSES: Massive stars at low metallicity -- V. Effect of metallicity on surface abundances of O stars

Authors: F. Martins, J. -C. Bouret, D. J. Hillier, S. A. Brands, P. A. Crowther, A. Herrero, F. Najarro, D. Pauli, J. Puls, V. Ramachandran, A. A. C. Sander, J. S. Vink, the XshootU collaboration

Abstract: Massive stars rotate faster, on average, than lower mass stars. Stellar rotation triggers hydrodynamical instabilities which transport angular momentum and chemical species from the core to the surface. Models of high-mass stars that include these processes predict that chemical mixing is stronger at lower metallicity. We aim to test this prediction by comparing the surface abundances of massive s… ▽ More Massive stars rotate faster, on average, than lower mass stars. Stellar rotation triggers hydrodynamical instabilities which transport angular momentum and chemical species from the core to the surface. Models of high-mass stars that include these processes predict that chemical mixing is stronger at lower metallicity. We aim to test this prediction by comparing the surface abundances of massive stars at different metallicities. We performed a spectroscopic analysis of single O stars in the Magellanic Clouds (MCs) based on the ULLYSES and XshootU surveys. We determined the fundamental parameters and helium, carbon, nitrogen, and oxygen surface abundances of 17 LMC and 17 SMC non-supergiant O6-9.5 stars. We complemented these determinations by literature results for additional MCs and also Galactic stars to increase the sample size and metallicity coverage. We investigated the differences in the surface chemical enrichment at different metallicities and compared them with predictions of three sets of evolutionary models. Surface abundances are consistent with CNO-cycle nucleosynthesis. The maximum surface nitrogen enrichment is stronger in MC stars than in Galactic stars. Nitrogen enrichment is also observed in stars with higher surface gravities in the SMC than in the Galaxy. This trend is predicted by models that incorporate chemical transport caused by stellar rotation. The distributions of projected rotational velocities in our samples are likely biased towards slow rotators. A metallicity dependence of surface abundances is demonstrated. The analysis of larger samples with an unbiased distribution of projected rotational velocities is required to better constrain the treatment of chemical mixing and angular momentum transport in massive single and binary stars. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 15 pages + appendix. Accepted in Astronomy & Astrophysics

arXiv:2405.00085 [pdf]

X-Shooting ULLYSES: Massive Stars at Low Metallicity

Authors: Jorick S. Vink, Paul Crowther, Alex Fullerton, Miriam Garcia, Fabrice Martins, Nidia Morrell, Lida Oskinova, Nicole St. Louis, Asif ud-Doula, Andreas Sander, Hugues Sana, Jean-Claude Bouret, Brankica Kubatova, Pablo Marchant, Lucimara P. Martins, Aida Wofford, Jacco van Loon, O. Grace Telford, Ylva Götberg, Dominic Bowman, Christi Erba, Venu Kalari, The XShootU Collaboration

Abstract: The Hubble Space Telescope has devoted 500 orbits to observing 250 massive stars with low metallicity in the ultraviolet (UV) range within the framework of the ULLYSES program. The X-Shooting ULLYSES (XShootU) project enhances the legacy value of this UV dataset by providing high-quality optical and near-infrared spectra, which are acquired using the wide-wavelength-coverage X-shooter spectrograph… ▽ More The Hubble Space Telescope has devoted 500 orbits to observing 250 massive stars with low metallicity in the ultraviolet (UV) range within the framework of the ULLYSES program. The X-Shooting ULLYSES (XShootU) project enhances the legacy value of this UV dataset by providing high-quality optical and near-infrared spectra, which are acquired using the wide-wavelength-coverage X-shooter spectrograph at ESO's Very Large Telescope. XShootU emphasises the importance of combining UV with optical spectra for the consistent determination of key stellar parameters such as effective temperature, surface gravity, luminosity, abundances, and wind characteristics including mass-loss rates as a function of metallicity. Since uncertainties in these parameters have implications across various branches of astrophysics, the data and modelling generated by the XShootU project are poised to significantly advance our understanding of massive stars at low metallicity. This is particularly crucial for confidently interpreting JWST data of the earliest stellar generations, making XShootU a unique resource for comprehending individual spectra of low-metallicity stars. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: 6 pages, 6 figures. ESO Large Programme Overview

Journal ref: ESO Messenger, 2024

arXiv:2403.12888 [pdf, other]

Electrical readout of spins in the absence of spin blockade

Authors: Felix-Ekkehard von Horstig, Lorenzo Peri, Sylvain Barraud, Jason A. W. Robinson, Monica Benito, Frederico Martins, M. Fernando Gonzalez-Zalba

Abstract: In semiconductor nanostructures, spin blockade (SB) is the most scalable mechanism for electrical spin readout requiring only two bound spins for its implementation which, in conjunction with charge sensing techniques, has led to high-fidelity readout of spins in semiconductor-based quantum processors. However, various mechanisms may lift SB, such as strong spin-orbit coupling (SOC) or low-lying e… ▽ More In semiconductor nanostructures, spin blockade (SB) is the most scalable mechanism for electrical spin readout requiring only two bound spins for its implementation which, in conjunction with charge sensing techniques, has led to high-fidelity readout of spins in semiconductor-based quantum processors. However, various mechanisms may lift SB, such as strong spin-orbit coupling (SOC) or low-lying excited states, hence posing challenges to perform spin readout at scale and with high fidelity in such systems. Here, we present a method, based on the dependence of the two-spin system polarizability on energy detuning, to perform spin state readout even when SB lifting mechanisms are dominant. It leverages SB lifting as a resource to detect different spin measurement outcomes selectively and positively. We demonstrate the method using a hybrid system formed by a quantum dot (QD) and a Boron acceptor in a silicon p-type transistor and show spin selective and positive readout of different spin states under SB lifting conditions due to (i) SOC and (ii) low-lying orbital states in the QD. We further use the method to determine the detuning-dependent spin relaxation time of 0.1-8~$μ$s. Our method should help perform high-fidelity projective spin measurements in systems subject to strong SOC and may facilitate quantum tomography and state leakage studies. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 13 pages, 10 figures

arXiv:2403.08314 [pdf, other]

Is Context Helpful for Chat Translation Evaluation?

Authors: Sweta Agrawal, Amin Farajian, Patrick Fernandes, Ricardo Rei, André F. T. Martins

Abstract: Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this doma… ▽ More Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing sentence-level automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.03923 [pdf, other]

Did Translation Models Get More Robust Without Anyone Even Noticing?

Authors: Ben Peters, André F. T. Martins

Abstract: Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to "noisy" inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisi… ▽ More Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to "noisy" inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments -- LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.03883 [pdf, other]

SaulLM-7B: A pioneering Large Language Model for Law

Authors: Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, Michael Desa

Abstract: In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency i… ▽ More In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the MIT License. △ Less

Submitted 7 March, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.17733 [pdf, other]

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

Authors: Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins

Abstract: While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and pa… ▽ More While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and parallel data, creating TowerBase, followed by finetuning on instructions relevant for translation processes, creating TowerInstruct. Our final model surpasses open alternatives on several tasks relevant to translation workflows and is competitive with general-purpose closed LLMs. To facilitate future research, we release the Tower models, our specialization dataset, an evaluation framework for LLMs focusing on the translation ecosystem, and a collection of model generations, including ours, on our benchmark. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.13725 [pdf, other]

Sparse and Structured Hopfield Networks

Authors: Saul Santos, Vlad Niculae, Daniel McNamee, Andre F. T. Martins

Abstract: Modern Hopfield networks have enjoyed recent interest due to their connection to attention in transformers. Our paper provides a unified framework for sparse Hopfield networks by establishing a link with Fenchel-Young losses. The result is a new family of Hopfield-Fenchel-Young energies whose update rules are end-to-end differentiable sparse transformations. We reveal a connection between loss mar… ▽ More Modern Hopfield networks have enjoyed recent interest due to their connection to attention in transformers. Our paper provides a unified framework for sparse Hopfield networks by establishing a link with Fenchel-Young losses. The result is a new family of Hopfield-Fenchel-Young energies whose update rules are end-to-end differentiable sparse transformations. We reveal a connection between loss margins, sparsity, and exact memory retrieval. We further extend this framework to structured Hopfield networks via the SparseMAP transformation, which can retrieve pattern associations instead of a single pattern. Experiments on multiple instance learning and text rationalization demonstrate the usefulness of our approach. △ Less

Submitted 4 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

Comments: 20 pages, 4 figures

arXiv:2402.00786 [pdf, other]

CroissantLLM: A Truly Bilingual French-English Language Model

Authors: Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo

Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust… ▽ More We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models. △ Less

Submitted 29 March, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

arXiv:2402.00707 [pdf, other]

Non-Exchangeable Conformal Language Generation with Nearest Neighbors

Authors: Dennis Ulmer, Chrysoula Zerva, André F. T. Martins

Abstract: Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap… ▽ More Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap by leveraging recent results on non-exchangeable conformal prediction, which still ensures bounds on coverage. The result, non-exchangeable conformal nucleus sampling, is a novel extension of the conformal prediction framework to generation based on nearest neighbors. Our method can be used post-hoc for an arbitrary model without extra training and supplies token-level, calibrated prediction sets equipped with statistical guarantees. Experiments in machine translation and language modeling show encouraging results in generation quality. By also producing tighter prediction sets with good coverage, we thus give a more theoretically principled way to perform sampling with conformal guarantees. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.16165 [pdf, other]

doi 10.1051/0004-6361/202449184

Evidence for Very Massive Stars in extremely UV-bright star-forming galaxies at $z \sim 2.2-3.6$

Authors: A. Upadhyaya, R. Marques-Chaves, D. Schaerer, F. Martins, I. Pérez-Fournon, A. Palacios, E. R. Stanway

Abstract: We present a comprehensive analysis of the presence of very massive stars (VMS > $100 M_{\odot}$) in the integrated spectra of 13 UV-bright star-forming galaxies at $2.2 \lesssim z \lesssim 3.6$ taken with the Gran Telescopio Canarias (GTC). These galaxies have very high UV absolute magnitudes ($M_{\rm UV} \simeq -24$), intense star formation (SFR $ \simeq 100-1000$ $M_{\odot}$ yr$^{-1}$), and met… ▽ More We present a comprehensive analysis of the presence of very massive stars (VMS > $100 M_{\odot}$) in the integrated spectra of 13 UV-bright star-forming galaxies at $2.2 \lesssim z \lesssim 3.6$ taken with the Gran Telescopio Canarias (GTC). These galaxies have very high UV absolute magnitudes ($M_{\rm UV} \simeq -24$), intense star formation (SFR $ \simeq 100-1000$ $M_{\odot}$ yr$^{-1}$), and metallicities in the range of 12+log(O/H) $\simeq8.10-8.50$ inferred from strong rest-optical lines. The GTC rest-UV spectra reveal spectral features indicative of very young stellar populations with VMS, such as strong P-Cygni line profiles in the wind lines N~{\sc v} $λ1240$ and C~{\sc iv} $λ1550$ along with intense and broad He~{\sc ii} $λ1640$ emission with $EW_{0}$ $\simeq 1.40-4.60$ Å, and FWHM $\simeq 1150-3170$ $km \ s^{-1}$. A Comparison with known VMS-dominated sources and typical galaxies without VMS reveals that some UV-bright galaxies closely resemble VMS-dominated clusters (e.g., R136 cluster). The presence of VMS is further supported by a quantitative comparison of the observed strength of the He~{\sc ii} emission with population synthesis models with and without VMS, where models with VMS are clearly preferred. Employing an empirical threshold for $EW_{0}$ (\heii) $\geq 3.0$ Å, along with the detection of other VMS-related spectral profiles (N~{\sc iv} $λ1486, 1719$), we classify nine out of 13 UV-bright galaxies as VMS-dominated sources. This high incidence of VMS-dominated sources in the UV-bright galaxy population ($\approx 70\%$) contrasts significantly with the negligible presence of VMS in typical $L_{\rm UV}^{*}$ LBGs at similar redshifts ($<1\%$). Our results thus indicate that VMS are common in UV-bright galaxies, suggesting a different initial mass function (IMF) with upper mass limits between $175 M_{\odot}$ and $475 M_{\odot}$. △ Less

Submitted 3 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 20 pages, 11 Figures, Accepted for Publication in Astronomy & Astrophysics

Journal ref: A&A 686, A185 (2024)

arXiv:2401.13303 [pdf, other]

MaLA-500: Massive Language Adaptation of Large Language Models

Authors: Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André F. T. Martins, Hinrich Schütze

Abstract: Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we em… ▽ More Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM △ Less

Submitted 3 April, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

arXiv:2312.00282 [pdf, other]

Stochastic volatility models with skewness selection

Authors: Igor Ferreira Batista Martins, Hedibert Freitas Lopes

Abstract: This paper expands traditional stochastic volatility models by allowing for time-varying skewness without imposing it. While dynamic asymmetry may capture the likely direction of future asset returns, it comes at the risk of leading to overparameterization. Our proposed approach mitigates this concern by leveraging sparsity-inducing priors to automatically selects the skewness parameter as being d… ▽ More This paper expands traditional stochastic volatility models by allowing for time-varying skewness without imposing it. While dynamic asymmetry may capture the likely direction of future asset returns, it comes at the risk of leading to overparameterization. Our proposed approach mitigates this concern by leveraging sparsity-inducing priors to automatically selects the skewness parameter as being dynamic, static or zero in a data-driven framework. We consider two empirical applications. First, in a bond yield application, dynamic skewness captures interest rate cycles of monetary easing and tightening being partially explained by central banks' mandates. In an currency modeling framework, our model indicates no skewness in the carry factor after accounting for stochastic volatility which supports the idea of carry crashes being the result of volatility surges instead of dynamic skewness. △ Less

Submitted 30 November, 2023; originally announced December 2023.

Comments: 22 pages, 8 figures

arXiv:2311.09132 [pdf, other]

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

Authors: Miguel Moura Ramos, Patrick Fernandes, António Farinhas, André F. T. Martins

Abstract: Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF's success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from… ▽ More Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF's success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation. In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach. Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality. Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality. △ Less

Submitted 4 July, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: EAMT 2024

arXiv:2310.13448 [pdf, other]

Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning

Authors: Duarte M. Alves, Nuno M. Guerreiro, João Alves, José Pombal, Ricardo Rei, José G. C. de Souza, Pierre Colombo, André F. T. Martins

Abstract: Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capa… ▽ More Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while kee** the added benefits of finetuning. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: Accepted at EMNLP 2023 - Findings

arXiv:2310.11430 [pdf, other]

An Empirical Study of Translation Hypothesis Ensembling with Large Language Models

Authors: António Farinhas, José G. C. de Souza, André F. T. Martins

Abstract: Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output. In this paper, we investigate how hypothesis ensembling can improve the quality of the generated text for the specific problem of LLM-based machine translation. We experiment with several techniques for ensembling hypotheses produced by LLMs such as ChatGPT, LLaMA, and A… ▽ More Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output. In this paper, we investigate how hypothesis ensembling can improve the quality of the generated text for the specific problem of LLM-based machine translation. We experiment with several techniques for ensembling hypotheses produced by LLMs such as ChatGPT, LLaMA, and Alpaca. We provide a comprehensive study along multiple dimensions, including the method to generate hypotheses (multiple prompts, temperature-based sampling, and beam search) and the strategy to produce the final translation (instruction-based, quality-based reranking, and minimum Bayes risk (MBR) decoding). Our results show that MBR decoding is a very effective method, that translation quality can be improved using a small number of samples, and that instruction tuning has a strong impact on the relation between the diversity of the hypotheses and the sampling temperature. △ Less

Submitted 17 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 (main conference)

arXiv:2310.10482 [pdf, other]

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection

Authors: Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, André F. T. Martins

Abstract: Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granula… ▽ More Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xCOMET, an open-source learned metric designed to bridge the gap between these approaches. xCOMET integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: Work in progress

arXiv:2310.06539 [pdf, ps, other]

doi 10.1051/0004-6361/202347909

Surface chemical composition of single WNh stars

Authors: Fabrice Martins

Abstract: Wolf-Rayet (WR) stars of the WNh category contain a significant fraction of hydrogen at their surface. They can be hydrogen-burning, very massive stars or stars in a post-main sequence phase of evolution. Also, WNh stars are sometimes not included in population synthesis models. We aim to better characterise the properties of single WNh stars in the Galaxy and the Magellanic Clouds. In particular,… ▽ More Wolf-Rayet (WR) stars of the WNh category contain a significant fraction of hydrogen at their surface. They can be hydrogen-burning, very massive stars or stars in a post-main sequence phase of evolution. Also, WNh stars are sometimes not included in population synthesis models. We aim to better characterise the properties of single WNh stars in the Galaxy and the Magellanic Clouds. In particular, we want to constrain their surface chemistry beyond the hydrogen content by determining the helium, carbon, and nitrogen surface abundances. We perform a spectroscopic analysis of 22 single WNh stars. We fit their ultraviolet and/or optical spectra using synthetic spectra computed with the code CMFGEN. We determine the main stellar parameters (temperature, luminosity, mass-loss rates) and the surface H, He, C, and N mass fractions. We investigate the ability of current evolutionary models to reproduce all parameters at the same time. We find that all WNh stars show the signatures of CNO-cycle material at their surface: they are carbon-depleted and nitrogen-rich. A clear trend of higher nitrogen content at higher metallicity is observed, as expected. The amount of hydrogen (X) varies significantly from one star to another, independently of luminosity. Values of X larger than 0.4 are not exceptional. The majority of Galactic WNh stars can be explained by evolutionary models, provided sufficient fine-tuning of the input parameters of evolutionary calculations. At lower metallicity, most stars escape predictions from evolutionary models. This has been noted in the literature but constraints on the surface nitrogen content exacerbate this severe issue. Our study highlights the need to refine the treatment of WR stars in both stellar evolution and population synthesis models. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 16 pages, 12 figures + appendix. Accepted in Astronomy & Astrophysics

Journal ref: A&A 680, A22 (2023)

arXiv:2310.01262 [pdf, other]

Non-Exchangeable Conformal Risk Control

Authors: António Farinhas, Chrysoula Zerva, Dennis Ulmer, André F. T. Martins

Abstract: Split conformal prediction has recently sparked great interest due to its ability to provide formally guaranteed uncertainty sets or intervals for predictions made by black-box neural models, ensuring a predefined probability of containing the actual ground truth. While the original formulation assumes data exchangeability, some extensions handle non-exchangeable data, which is often the case in m… ▽ More Split conformal prediction has recently sparked great interest due to its ability to provide formally guaranteed uncertainty sets or intervals for predictions made by black-box neural models, ensuring a predefined probability of containing the actual ground truth. While the original formulation assumes data exchangeability, some extensions handle non-exchangeable data, which is often the case in many real-world scenarios. In parallel, some progress has been made in conformal methods that provide statistical guarantees for a broader range of objectives, such as bounding the best $F_1$-score or minimizing the false negative rate in expectation. In this paper, we leverage and extend these two lines of work by proposing non-exchangeable conformal risk control, which allows controlling the expected value of any monotone loss function when the data is not exchangeable. Our framework is flexible, makes very few assumptions, and allows weighting the data based on its relevance for a given test example; a careful choice of weights may result on tighter bounds, making our framework useful in the presence of change points, time series, or other forms of distribution drift. Experiments with both synthetic and real world data show the usefulness of our method. △ Less

Submitted 26 January, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2309.11925 [pdf, other]

Scaling up COMETKIWI: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task

Authors: Ricardo Rei, Nuno M. Guerreiro, José Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, José G. C. de Souza, André F. T. Martins

Abstract: We present the joint contribution of Unbabel and Instituto Superior Técnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks,… ▽ More We present the joint contribution of Unbabel and Instituto Superior Técnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art COMETKIWI-22, we show large improvements in correlation with human judgements (up to 10 Spearman points). Moreover, we surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2308.14489 [pdf, ps, other]

doi 10.1051/0004-6361/202346732

Inferring the presence of very massive stars in local star-forming regions

Authors: Fabrice Martins, Daniel Schaerer, Rui Marques-Chaves, Ankur Upadhyaya

Abstract: We present a study aiming at detecting VMS in local star-forming region from the imprint they leave on the integrated UV and optical light. We analyzed a sample of 27 star-forming regions and galaxies in the local Universe. We selected sources with a metallicity close to that of the LMC. We defined empirical criteria to distinguish sources dominated by VMS and Wolf-Rayet stars (WR), using template… ▽ More We present a study aiming at detecting VMS in local star-forming region from the imprint they leave on the integrated UV and optical light. We analyzed a sample of 27 star-forming regions and galaxies in the local Universe. We selected sources with a metallicity close to that of the LMC. We defined empirical criteria to distinguish sources dominated by VMS and Wolf-Rayet stars (WR), using template spectra of VMS- and WR-dominated regions. We subsequently built population synthesis models with an updated treatment of VMS. We show that the UV range alone is not sufficient to distinguish between VMS- and WR-dominated sources. The region of the WR bumps in the optical breaks the degeneracy. In particular, the morphology of the blue bump at 4640-4686 A is a key diagnostic. Beyond the prototypical R136 region we identify two galaxies showing clear signatures of VMS. In two other galaxies or regions the presence of VMS can be suspected, as already discussed in the literature. The stellar population is clearly dominated by WR stars in seven other sources. The most recent BPASS population synthesis models can neither account for the strong HeII 1640 emission, nor for the shape of the blue bump in VMS- and WR-dominated sources. Our models that include VMS more realistically reproduce the UV-optical spectra of VMS-dominated sources. We conclude that VMS are present in some local star-forming regions, but that separating them from WR-dominated populations requires optical spectroscopy with a high signal-to-noise ratio. A high equivalent width of HeII 1640 is not a sufficient condition for identifying VMS. Populations synthesis models need to take VMS into account by incorporating not only evolutionary tracks, but also dedicated spectral libraries. Finally, we stress that the treatment of WR stars needs to be improved as well. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: 16 pages, 10 figures + appendix. Accepted in Astronomy and Astrophysics

Journal ref: A&A 678, A159 (2023)

arXiv:2308.07286 [pdf, other]

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

Authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat

Abstract: Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by pro… ▽ More Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: 19 pages

arXiv:2307.10018 [pdf, other]

RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023

Authors: Aline Lima de Oliveira, Cauê Addae da Silva Gomes, Cecília Virginia Santos da Silva, Charles Matheus de Sousa Alves, Danilo Andrade Martins de Souza, Driele Pires Ferreira Araújo Xavier, Edgleyson Pereira da Silva, Felipe Bezerra Martins, Lucas Henrique Cavalcanti Santos, Lucas Dias Maciel, Matheus Paixão Gumercindo dos Santos, Matheus Lafayette Vasconcelos, Matheus Vinícius Teotonio do Nascimento Andrade, João Guilherme Oliveira Carvalho de Melo, João Pedro Souza Pereira de Moura, José Ronald da Silva, José Victor Silva Cruz, Pedro Henrique Santana de Morais, Pedro Paulo Salman de Oliveira, Riei Joaquim Matos Rodrigues, Roberto Costa Fernandes, Ryan Vinicius Santos Morais, Tamara Mayara Ramos Teobaldo, Washington Igor dos Santos Silva, Edna Natividade Silva Barros

Abstract: RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Ou… ▽ More RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Our team has successfully published 2 articles related to SSL at two high-impact conferences: the 25th RoboCup International Symposium and the 19th IEEE Latin American Robotics Symposium (LARS 2022). Over the last year, we have been continuously migrating from our past codebase to Unification. We will describe the new architecture implemented and some points of software and AI refactoring. In addition, we discuss the process of integrating machined components into the mechanical system, our development for participating in the vision blackout challenge last year and what we are preparing for this year. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2306.06221 [pdf, other]

Conformalizing Machine Translation Evaluation

Authors: Chrysoula Zerva, André F. T. Martins

Abstract: Several uncertainty estimation methods have been recently proposed for machine translation evaluation. While these methods can provide a useful indication of when not to trust model predictions, we show in this paper that the majority of them tend to underestimate model uncertainty, and as a result they often produce misleading confidence intervals that do not cover the ground truth. We propose as… ▽ More Several uncertainty estimation methods have been recently proposed for machine translation evaluation. While these methods can provide a useful indication of when not to trust model predictions, we show in this paper that the majority of them tend to underestimate model uncertainty, and as a result they often produce misleading confidence intervals that do not cover the ground truth. We propose as an alternative the use of conformal prediction, a distribution-free method to obtain confidence intervals with a theoretically established guarantee on coverage. First, we demonstrate that split conformal prediction can ``correct'' the confidence intervals of previous methods to yield a desired coverage level. Then, we highlight biases in estimated confidence intervals, both in terms of the translation language pairs and the quality of translations. We apply conditional conformal prediction techniques to obtain calibration subsets for each data subgroup, leading to equalized coverage. △ Less

Submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.19348 [pdf, ps, other]

doi 10.1103/PhysRevE.108.044113

Topologically-constrained fluctuations and thermodynamics regulate nonequilibrium response

Authors: Gabriela Fernandes Martins, Jordan M. Horowitz

Abstract: Limits on a system's response to external perturbations inform our understanding of how physical properties can be shaped by microscopic characteristics. Here, we derive constraints on the steady-state nonequilibrium response of physical observables in terms of the topology of the microscopic state space and the strength of thermodynamic driving. Notably, evaluation of these limits requires no kin… ▽ More Limits on a system's response to external perturbations inform our understanding of how physical properties can be shaped by microscopic characteristics. Here, we derive constraints on the steady-state nonequilibrium response of physical observables in terms of the topology of the microscopic state space and the strength of thermodynamic driving. Notably, evaluation of these limits requires no kinetic information beyond the state-space structure. When applied to models of receptor binding, we find that sensitivity is bounded by the steepness of a Hill function with a Hill coefficient enhanced by the chemical driving beyond the structural equilibrium limit. △ Less

Submitted 26 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: 25 pages, 13 figures

Journal ref: Phys. Rev. E 108, 044113 (2023)

arXiv:2305.19144 [pdf, other]

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Authors: Taisiya Glushkova, Chrysoula Zerva, André F. T. Martins

Abstract: Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical errors, such as deviations in entities and numbers. In contrast, traditional evaluation metrics, such as BLEU or chrF, which measure lexical or character overla… ▽ More Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical errors, such as deviations in entities and numbers. In contrast, traditional evaluation metrics, such as BLEU or chrF, which measure lexical or character overlap between translation hypotheses and human references, have lower correlations with human judgements but are sensitive to such deviations. In this paper, we investigate several ways of combining the two approaches in order to increase robustness of state-of-the-art evaluation methods to translations with critical errors. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena, which leads to gains in correlation with human judgments and on recent challenge sets on several language pairs. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted at EAMT 2023

arXiv:2305.17075 [pdf, other]

CREST: A Joint Framework for Rationalization and Counterfactual Text Generation

Authors: Marcos Treviso, Alexis Ross, Nuno M. Guerreiro, André F. T. Martins

Abstract: Selective rationales and counterfactual examples have emerged as two effective, complementary classes of interpretability methods for analyzing and training NLP models. However, prior work has not explored how these methods can be integrated to combine their complementary advantages. We overcome this limitation by introducing CREST (ContRastive Edits with Sparse raTionalization), a joint framework… ▽ More Selective rationales and counterfactual examples have emerged as two effective, complementary classes of interpretability methods for analyzing and training NLP models. However, prior work has not explored how these methods can be integrated to combine their complementary advantages. We overcome this limitation by introducing CREST (ContRastive Edits with Sparse raTionalization), a joint framework for selective rationalization and counterfactual text generation, and show that this framework leads to improvements in counterfactual quality, model robustness, and interpretability. First, CREST generates valid counterfactuals that are more natural than those produced by previous methods, and subsequently can be used for data augmentation at scale, reducing the need for human-generated examples. Second, we introduce a new loss function that leverages CREST counterfactuals to regularize selective rationales and show that this regularization improves both model robustness and rationale quality, compared to methods that do not leverage CREST counterfactuals. Our results demonstrate that CREST successfully bridges the gap between selective rationales and counterfactual examples, addressing the limitations of existing methods and providing a more comprehensive view of a model's predictions. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted at ACL 2023 (main)

arXiv:2305.13684 [pdf, other]

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

Authors: Peiqin Lin, Chengzhi Hu, Zheyu Zhang, André F. T. Martins, Hinrich Schütze

Abstract: Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we… ▽ More Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance. △ Less

Submitted 5 July, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: EACL 2024 Findings

arXiv:2305.12182 [pdf, other]

doi 10.18653/v1/2023.acl-long.61

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Authors: Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze

Abstract: The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages an… ▽ More The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500. △ Less

Submitted 26 May, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

Comments: ACL 2023

arXiv:2305.11806 [pdf, other]

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

Authors: Ricardo Rei, Nuno M. Guerreiro, Marcos Treviso, Luisa Coheur, Alon Lavie, André F. T. Martins

Abstract: Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, "black boxes" returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and… ▽ More Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, "black boxes" returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: https://github.com/Unbabel/COMET/tree/explainable-metrics. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted at ACL 2023

arXiv:2305.06376 [pdf, other]

doi 10.1051/0004-6361/202245650

X-Shooting ULLYSES: massive stars at low metallicity. I. Project Description

Authors: Jorick S. Vink, A. Mehner, P. A. Crowther, A. Fullerton, M. Garcia, F. Martins, N. Morrell, L. M. Oskinova, N. St-Louis, A. ud-Doula, A. A. C. Sander, H. Sana, J. -C. Bouret, B. Kubatova, P. Marchant, L. P. Martins, A. Wofford, J. Th. van Loon, O. Grace Telford, Y. Gotberg, D. M. Bowman, C. Erba, V. M. Kalari, M. Abdul-Masih, T. Alkousa , et al. (56 additional authors not shown)

Abstract: Observations of individual massive stars, super-luminous supernovae, gamma-ray bursts, and gravitational-wave events involving spectacular black-hole mergers, indicate that the low-metallicity Universe is fundamentally different from our own Galaxy. Many transient phenomena will remain enigmatic until we achieve a firm understanding of the physics and evolution of massive stars at low metallicity… ▽ More Observations of individual massive stars, super-luminous supernovae, gamma-ray bursts, and gravitational-wave events involving spectacular black-hole mergers, indicate that the low-metallicity Universe is fundamentally different from our own Galaxy. Many transient phenomena will remain enigmatic until we achieve a firm understanding of the physics and evolution of massive stars at low metallicity (Z). The Hubble Space Telescope has devoted 500 orbits to observe 250 massive stars at low Z in the ultraviolet (UV) with the COS and STIS spectrographs under the ULLYSES program. The complementary ``X-Shooting ULLYSES'' (XShootU) project provides enhanced legacy value with high-quality optical and near-infrared spectra obtained with the wide-wavelength coverage X-shooter spectrograph at ESO's Very Large Telescope. We present an overview of the XShootU project, showing that combining ULLYSES UV and XShootU optical spectra is critical for the uniform determination of stellar parameters such as effective temperature, surface gravity, luminosity, and abundances, as well as wind properties such as mass-loss rates in function of Z. As uncertainties in stellar and wind parameters percolate into many adjacent areas of Astrophysics, the data and modelling of the XShootU project is expected to be a game-changer for our physical understanding of massive stars at low Z. To be able to confidently interpret James Webb Space Telescope (JWST) spectra of the first stellar generations, the individual spectra of low Z stars need to be understood, which is exactly where XShootU can deliver. △ Less

Submitted 1 June, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: Accepted in A&A - 35 Pages, 12 Figures, 4 Tables, 2 Large Tables

Journal ref: A&A 675, A154 (2023)

arXiv:2305.03182 [pdf, ps, other]

The Darboux-KP system as an integrable Chern-Simons multiform theory in infinite dimensional space

Authors: Joao Faria Martins, Frank W Nijhoff, Daniel Riccombeni

Abstract: In a previous paper by one of the authors, a Lagrangian 3-form structure was established for a generalised Darboux system, originally describing orthogonal curvilinear coordinate systems, which encodes the Kadomtsev-Petviashvili (KP) hierarchy. Here a hierarchy of Lagrangian multiforms is established for the same system, viewed as a hierarchy of Chern-Simons actions in an infinite-dimensional spac… ▽ More In a previous paper by one of the authors, a Lagrangian 3-form structure was established for a generalised Darboux system, originally describing orthogonal curvilinear coordinate systems, which encodes the Kadomtsev-Petviashvili (KP) hierarchy. Here a hierarchy of Lagrangian multiforms is established for the same system, viewed as a hierarchy of Chern-Simons actions in an infinite-dimensional space of Miwa variables, constituting the variational form of a universal 3D integrable system embedded in this infinite-dimensional space. △ Less

Submitted 4 May, 2023; originally announced May 2023.

arXiv:2305.00955 [pdf, other]

Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation

Authors: Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José G. C. de Souza, Shuyan Zhou, Tongshuang Wu, Graham Neubig, André F. T. Martins

Abstract: Many recent advances in natural language generation have been fueled by training large language models on internet-scale data. However, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. As models become more capable, human feedback is an invaluable signal for evaluating and improving mod… ▽ More Many recent advances in natural language generation have been fueled by training large language models on internet-scale data. However, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for human intervention. △ Less

Submitted 31 May, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

Comments: Work in Progress

arXiv:2304.13442 [pdf, other]

doi 10.1103/PhysRevApplied.21.044016

Multi-module microwave assembly for fast read-out and charge noise characterization of silicon quantum dots

Authors: Felix-Ekkehard von Horstig, David J. Ibberson, Giovanni A. Oakes, Laurence Cochrane, David F. Wise, Nadia Stelmashenko, Sylvain Barraud, Jason A. W. Robinson, Frederico Martins, M. Fernando Gonzalez-Zalba

Abstract: Fast measurements of quantum devices is important in areas such as quantum sensing, quantum computing and nanodevice quality analysis. Here, we develop a superconductor-semiconductor multi-module microwave assembly to demonstrate charge state readout at the state-of-the-art. The assembly consist of a superconducting readout resonator interfaced to a silicon-on-insulator (SOI) chiplet containing qu… ▽ More Fast measurements of quantum devices is important in areas such as quantum sensing, quantum computing and nanodevice quality analysis. Here, we develop a superconductor-semiconductor multi-module microwave assembly to demonstrate charge state readout at the state-of-the-art. The assembly consist of a superconducting readout resonator interfaced to a silicon-on-insulator (SOI) chiplet containing quantum dots (QDs) in a high-$κ$ nanowire transistor. The superconducting chiplet contains resonant and coupling elements as well as $LC$ filters that, when interfaced with the silicon chip, result in a resonant frequency $f=2.12$ GHz, a loaded quality factor $Q=850$, and a resonator impedance $Z=470$ $Ω$. Combined with the large gate lever arms of SOI technology, we achieve a minimum integration time for single and double QD transitions of 2.77 ns and 13.5 ns, respectively. We utilize the assembly to measure charge noise over 9 decades of frequency up to 500 kHz and find a 1/$f$ dependence across the whole frequency spectrum as well as a charge noise level of 4 $μ$eV/$\sqrt{\text{Hz}}$ at 1 Hz. The modular microwave circuitry presented here can be directly utilized in conjunction with other quantum device to improve the readout performance as well as enable large bandwidth noise spectroscopy, all without the complexity of superconductor-semiconductor monolithic fabrication. △ Less

Submitted 2 May, 2024; v1 submitted 26 April, 2023; originally announced April 2023.

Comments: Main: 8 pages, 4 figures. Supplementary: 4 pages, 7 figures

arXiv:2304.08457 [pdf, other]

doi 10.1016/j.chaos.2023.113579

Deep Learning Criminal Networks

Authors: Haroldo V. Ribeiro, Diego D. Lopes, Arthur A. B. Pessa, Alvaro F. Martins, Bruno R. da Cunha, Sebastian Goncalves, Ervin K. Lenzi, Quentin S. Hanley, Matjaz Perc

Abstract: Recent advances in deep learning methods have enabled researchers to develop and apply algorithms for the analysis and modeling of complex networks. These advances have sparked a surge of interest at the interface between network science and machine learning. Despite this, the use of machine learning methods to investigate criminal networks remains surprisingly scarce. Here, we explore the potenti… ▽ More Recent advances in deep learning methods have enabled researchers to develop and apply algorithms for the analysis and modeling of complex networks. These advances have sparked a surge of interest at the interface between network science and machine learning. Despite this, the use of machine learning methods to investigate criminal networks remains surprisingly scarce. Here, we explore the potential of graph convolutional networks to learn patterns among networked criminals and to predict various properties of criminal networks. Using empirical data from political corruption, criminal police intelligence, and criminal financial networks, we develop a series of deep learning models based on the GraphSAGE framework that are able to recover missing criminal partnerships, distinguish among types of associations, predict the amount of money exchanged among criminal agents, and even anticipate partnerships and recidivism of criminals during the growth dynamics of corruption networks, all with impressive accuracy. Our deep learning models significantly outperform previous shallow learning approaches and produce high-quality embeddings for node and edge properties. Moreover, these models inherit all the advantages of the GraphSAGE framework, including the generalization to unseen nodes and scaling up to large graph structures. △ Less

Submitted 4 June, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: 14 two-column pages, 5 figures

Journal ref: Chaos, Solitons & Fractals 172, 113579 (2023)

arXiv:2303.16104 [pdf, other]

Hallucinations in Large Multilingual Translation Models

Authors: Nuno M. Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, André F. T. Martins

Abstract: Large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages, making them increasingly appealing for real-world applications. However, when deployed in the wild, these models may generate hallucinated translations which have the potential to severely undermine user trust and raise safety concerns. Existing research on ha… ▽ More Large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages, making them increasingly appealing for real-world applications. However, when deployed in the wild, these models may generate hallucinated translations which have the potential to severely undermine user trust and raise safety concerns. Existing research on hallucinations has primarily focused on small bilingual models trained on high-resource languages, leaving a gap in our understanding of hallucinations in massively multilingual models across diverse translation scenarios. In this work, we fill this gap by conducting a comprehensive analysis on both the M2M family of conventional neural machine translation models and ChatGPT, a general-purpose large language model~(LLM) that can be prompted for translation. Our investigation covers a broad spectrum of conditions, spanning over 100 translation directions across various resource levels and going beyond English-centric language pairs. We provide key insights regarding the prevalence, properties, and mitigation of hallucinations, paving the way towards more responsible and reliable machine translation systems. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2303.00121 [pdf, other]

doi 10.1088/1748-0221/18/06/P06023

Laser calibration of the ATLAS Tile Calorimeter during LHC Run 2

Authors: M. N. Agaras, A. Ahmad, A. Blanco, D. Boumediene, R. Bonnefoy, D. Calvet, M. Calvetti, R. Chadelas, P. Conde Muino, A. Cortes Gonzalez, M. Crouau, C. Crozatier, F. Daudon, T. Davidek, G. Di Gregorio, L. Fiorini, B. Galhardo, Ph. Gris, P. Klimek, P. Lafarguette, D. Lambert, S. Leone, A. Maio, M. Marjanovic, F. Martins , et al. (15 additional authors not shown)

Abstract: This article reports the laser calibration of the hadronic Tile Calorimeter of the ATLAS experiment in the LHC Run 2 data campaign. The upgraded Laser II calibration system is described. The system was commissioned during the first LHC Long Shutdown, exhibiting a stability better than 0.8% for the laser light monitoring. The methods employed to derive the detector calibration factors with data fro… ▽ More This article reports the laser calibration of the hadronic Tile Calorimeter of the ATLAS experiment in the LHC Run 2 data campaign. The upgraded Laser II calibration system is described. The system was commissioned during the first LHC Long Shutdown, exhibiting a stability better than 0.8% for the laser light monitoring. The methods employed to derive the detector calibration factors with data from the laser calibration runs are also detailed. These allowed to correct for the response fluctuations of the 9852 photomultiplier tubes of the Tile Calorimeter with a total uncertainty of 0.5% plus a luminosity-dependent sub-dominant term. Finally, we report the regular monitoring and performance studies using laser events in both standalone runs and during proton collisions. These studies include channel timing and quality inspection, and photomultiplier linearity and response dependence on anode current. △ Less

Submitted 5 July, 2023; v1 submitted 28 February, 2023; originally announced March 2023.

Journal ref: JINST 18 (2023) 06, P06023

arXiv:2301.07473 [pdf, other]

Discrete Latent Structure in Neural Networks

Authors: Vlad Niculae, Caio F. Corro, Nikita Nangia, Tsvetomila Mihaylova, André F. T. Martins

Abstract: Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions.… ▽ More Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties. △ Less

Submitted 18 January, 2023; originally announced January 2023.

ACM Class: I.2.6

arXiv:2301.04672 [pdf, other]

doi 10.1051/0004-6361/202345895

Clues on the presence and segregation of very massive stars in the Sunburst Lyman-continuum cluster at z=2.37

Authors: U. Mestric, E. Vanzella, A. Upadhyaya, F. Martins, R. Marques-Chaves, D. Schaerer, J. Guibert, A. Zanella, C. Grillo, P. Rosati, F. Calura, G. B. Caminha, A. Bolamperti, M. Meneghetti, P. Bergamini, A. Mercurio, M. Nonino, R. Pascale

Abstract: We report the identification of very massive stars (VMS; mass $> 100$\,\msun) that may be segregated in the center of the young massive star cluster at $z$=2.37 hosted in the lensed galaxy called {\tt Sunburst} galaxy. This result is based on two pieces of evidence: (1) VLT/MUSE spectra of several multiple images of the same star cluster show key spectral signatures of VMS, such as the \heii\ broa… ▽ More We report the identification of very massive stars (VMS; mass $> 100$\,\msun) that may be segregated in the center of the young massive star cluster at $z$=2.37 hosted in the lensed galaxy called {\tt Sunburst} galaxy. This result is based on two pieces of evidence: (1) VLT/MUSE spectra of several multiple images of the same star cluster show key spectral signatures of VMS, such as the \heii\ broad emission, \nivblue\ emission, and an \niv\ P-Cygni profile. In particular, \heii\ is broad ($\sim1610\pm300$ \kms), with an equivalent width of 3Å,\ and asymmetric profile. These features require an extremely young ($\sim2.5$ Myr) stellar population component in which the masses of the stars exceed 100~\msun. When a Salpeter initial mass function and BPASS models for normal massive stars are assumed, the observed spectral features require $\sim$400 VMS. (2) The same star cluster is detected at a signal-to-noise ratio of~$\sim100$ in the Lyman continuum domain ($λ< 900$Å). The Lyman continuum emission emerges from a region with a radius that is at least twice smaller than what is observed at 1700Å~(independently of magnification) and is located in the center of the cluster. After delensing, the effective radii in absolute scales are R$_{\tt eff}[{\tt LyC}]\sim4.7 \pm 1.5$ pc and R$_{\tt eff}[1700]= 7.8 \pm 1.4$ pc. The Lyman continuum radiation is mainly produced by hot and massive stars, which implies that their spatial distribution (including that of VMS) is preferentially more confined in the central parts of the cluster. Approximately 400 VMS hosted by a cluster of $\sim 10^7$ \msun\ produce $\sim$15\% of the esca** Lyman continuum photons, and the remaining photons are produced by other massive early-type stars. △ Less

Submitted 22 March, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

Comments: 10 pages, 8 figures, Accepted to publication in A&A

Journal ref: A&A 673, A50 (2023)

Showing 1–50 of 347 results for author: Martins, F