-
Measuring Cross-lingual Transfer in Bytes
Authors:
Leandro Rodrigues de Souza,
Thales Sales Almeida,
Roberto Lotufo,
Rodrigo Nogueira
Abstract:
Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamina…
▽ More
Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Sabiá-2: A New Generation of Portuguese Large Language Models
Authors:
Thales Sales Almeida,
Hugo Abonizio,
Rodrigo Nogueira,
Ramon Pires
Abstract:
We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Mediu…
▽ More
We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Medium, matches or surpasses GPT-4's performance in 23 out of 64 exams and outperforms GPT-3.5 in 58 out of 64 exams. Notably, specialization has a significant impact on a model's performance without the need to increase its size, allowing us to offer Sabiá-2 Medium at a price per token that is 10 times cheaper than GPT-4. Finally, we identified that math and coding are key abilities that need improvement.
△ Less
Submitted 26 March, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams
Authors:
Ramon Pires,
Thales Sales Almeida,
Hugo Abonizio,
Rodrigo Nogueira
Abstract:
Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance e…
▽ More
Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities. Our study not only reaffirms the capabilities of GPT-4 as the state of the art for handling complex multidisciplinary questions, but also pioneers in offering a realistic assessment of multimodal language models on Portuguese examinations. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams
Authors:
Thales Sales Almeida,
Thiago Laitz,
Giovana K. Bonás,
Rodrigo Nogueira
Abstract:
One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazi…
▽ More
One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Sabiá: Portuguese Large Language Models
Authors:
Ramon Pires,
Hugo Abonizio,
Thales Sales Almeida,
Rodrigo Nogueira
Abstract:
As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demo…
▽ More
As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
△ Less
Submitted 9 November, 2023; v1 submitted 16 April, 2023;
originally announced April 2023.
-
NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost
Authors:
Thales Sales Almeida,
Thiago Laitz,
João Seródio,
Luiz Henrique Bonifacio,
Roberto Lotufo,
Rodrigo Nogueira
Abstract:
The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this w…
▽ More
The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices. The system is available at https://neuralsearchx.nsx.ai.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Mining Artifacts in Mycelium SEM Micrographs
Authors:
Thaicia Stona de Almeida
Abstract:
Mycelium is a promising biomaterial based on fungal mycelium, a highly porous, nanofibrous structure. Scanning electron micrographs are used to characterize its network, but the currently available tools for nanofibrous microstructures do not contemplate the particularities of biomaterials. The adoption of a software for artificial nanofibrous in mycelium characterization adds the uncertainty of i…
▽ More
Mycelium is a promising biomaterial based on fungal mycelium, a highly porous, nanofibrous structure. Scanning electron micrographs are used to characterize its network, but the currently available tools for nanofibrous microstructures do not contemplate the particularities of biomaterials. The adoption of a software for artificial nanofibrous in mycelium characterization adds the uncertainty of imaging artifact formation to the analysis. The reported work combines supervised and unsupervised machine learning methods to automate the identification of artifacts in the mapped pores of mycelium microstructure.
Keywords: Machine learning; unsupervised learning; image processing; mycelium; microstructure informatics
△ Less
Submitted 12 March, 2021;
originally announced March 2021.