Skip to main content

Showing 1–20 of 20 results for author: Armengol-Estapé, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.18751  [pdf, other

    cs.CV cs.AI

    On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

    Authors: Jordi Armengol-Estapé, Vincent Michalski, Ramnath Kumar, Pierre-Luc St-Charles, Doina Precup, Samira Ebrahimi Kahou

    Abstract: Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists o… ▽ More

    Submitted 30 May, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  2. arXiv:2404.16041  [pdf, other

    cs.PL cs.AI cs.LG

    Forklift: An Extensible Neural Lifter

    Authors: Jordi Armengol-Estapé, Rodrigo C. O. Rocha, Jackson Woodruff, Pasquale Minervini, Michael F. P. O'Boyle

    Abstract: The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages. However, the development of these tools requires substantial engineering effort. State-of-the-art approaches use lifting, a technique where source assembly code is translated to an a… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  3. arXiv:2305.12520  [pdf, other

    cs.PL cs.AI

    SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly

    Authors: Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, Michael F. P. O'Boyle

    Abstract: Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code.… ▽ More

    Submitted 15 February, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

  4. Matching Linear Algebra and Tensor Code to Specialized Hardware Accelerators

    Authors: Pablo Antonio Martínez, Jackson Woodruff, Jordi Armengol-Estapé, Gregorio Bernabé, José Manuel García, Michael F. P. O'Boyle

    Abstract: Dedicated tensor accelerators demonstrate the importance of linear algebra in modern applications. Such accelerators have the potential for impressive performance gains, but require programmers to rewrite code using vendor APIs - a barrier to wider scale adoption. Recent work overcomes this by matching and replacing patterns within code, but such approaches are fragile and fail to cope with the di… ▽ More

    Submitted 31 January, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Comments: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC '23), February 25-26, 2023, Montréal, QC, Canada, https://doi.org/10.1145/3578360.3580262

    Journal ref: In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC '23), February 25-26, 2023, Montréal, QC, Canada

  5. arXiv:2206.15147  [pdf, ps, other

    cs.CL cs.AI

    esCorpius: A Massive Spanish Crawling Corpus

    Authors: Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

    Abstract: In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish… ▽ More

    Submitted 1 July, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: esCorpius is available on https://huggingface.co/datasets/LHF/escorpius

  6. arXiv:2202.06871  [pdf, ps, other

    cs.CL cs.AI

    Sequence-to-Sequence Resources for Catalan

    Authors: Ona de Gibert, Ksenia Kharitonova, Blanca Calvo Figueras, Jordi Armengol-Estapé, Maite Melero

    Abstract: In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT). We present two new abstractive summarization datasets in the domain of newswire. We also introduce a parallel Catalan-English corpus, paired with three different brand new test sets. Finally, we evaluate the da… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

  7. arXiv:2112.05404  [pdf, other

    cs.CV cs.LG

    The Large Labelled Logo Dataset (L3D): A Multipurpose and Hand-Labelled Continuously Growing Dataset

    Authors: Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé

    Abstract: In this work, we present the Large Labelled Logo Dataset (L3D), a multipurpose, hand-labelled, continuously growing dataset. It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUIPO) open registry. Each of them is associated to multiple labels that classify the figurative and textual elements that appear in the images. These an… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

  8. arXiv:2111.00526  [pdf, other

    cs.CL q-fin.CP q-fin.PM

    FinEAS: Financial Embedding Analysis of Sentiment

    Authors: Asier Gutiérrez-Fandiño, Miquel Noguer i Alonso, Petter Kolm, Jordi Armengol-Estapé

    Abstract: We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS). In financial markets, news and investor sentiment are significant drivers of security prices. Thus, leveraging the capabilities of modern NLP approaches for financial sentiment analysis is a crucial component in identifying patterns and trends that are useful for market participan… ▽ More

    Submitted 19 November, 2021; v1 submitted 31 October, 2021; originally announced November 2021.

  9. arXiv:2110.12201  [pdf, ps, other

    cs.CL cs.AI

    Spanish Legalese Language Model and Corpora

    Authors: Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, Marta Villegas

    Abstract: There are many Language Models for the English language according to its worldwide relevance. However, for the Spanish language, even if it is a widely spoken language, there are very few Spanish Language Models which result to be small and too general. Legal slang could be think of a Spanish variant on its own as it is very complicated in vocabulary, semantics and phrase understanding. For this w… ▽ More

    Submitted 23 October, 2021; originally announced October 2021.

  10. arXiv:2109.07765  [pdf, ps, other

    cs.CL

    Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

    Authors: Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet, Asier Gutiérrez-Fandiño, Aitor Gonzalez-Agirre, Martin Krallinger, Marta Villegas

    Abstract: We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been emplo… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

  11. arXiv:2109.03570  [pdf, other

    cs.CL

    Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

    Authors: Casimiro Pio Carrino, Jordi Armengol-Estapé, Asier Gutiérrez-Fandiño, Joan Llop-Palao, Marc Pàmies, Aitor Gonzalez-Agirre, Marta Villegas

    Abstract: This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross… ▽ More

    Submitted 17 September, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

    Comments: 9 pages

  12. arXiv:2108.13349  [pdf, other

    cs.CL cs.AI

    On the Multilingual Capabilities of Very Large-Scale English Language Models

    Authors: Jordi Armengol-Estapé, Ona de Gibert Bonet, Maite Melero

    Abstract: Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning. These models, solely trained on the language modeling objective, have been shown to exhibit outstanding few-shot learning capabilities in a number of different tasks. Nevertheless, aside from anecdotal experiences, little is known regarding their multilingual capabilities… ▽ More

    Submitted 30 August, 2021; originally announced August 2021.

  13. arXiv:2108.07639  [pdf, ps, other

    cs.AI cs.PL

    Learning C to x86 Translation: An Experiment in Neural Compilation

    Authors: Jordi Armengol-Estapé, Michael F. P. O'Boyle

    Abstract: Deep learning has had a significant impact on many fields. Recently, code-to-code neural models have been used in code translation, code refinement and decompilation. However, the question of whether these models can automate compilation has yet to be investigated. In this work, we explore neural compilation, building and evaluating Transformer models that learn how to produce x86 assembler from C… ▽ More

    Submitted 16 December, 2022; v1 submitted 17 August, 2021; originally announced August 2021.

    Comments: Published in AIPLANS 2021

    Journal ref: Armengol-Estapé, J. and O'Boyle, M. Learning C to x86 translation: An experiment in neural compilation. In Advances in Programming Languages and Neurosymbolic Systems Workshop, 2021. URL \url{https://openreview.net/forum?id=444ug_EYXet}

  14. arXiv:2107.07903  [pdf, other

    cs.CL

    Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

    Authors: Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas

    Abstract: Multilingual language models have been a crucial breakthrough as they considerably reduce the need of data for under-resourced languages. Nevertheless, the superiority of language-specific models has already been proven for languages having access to large amounts of data. In this work, we focus on Catalan with the aim to explore to what extent a medium-sized monolingual language model is competit… ▽ More

    Submitted 16 July, 2021; originally announced July 2021.

    Comments: Accepted into Findings of ACL-IJCNLP 2021

  15. MarIA: Spanish Language Models

    Authors: Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Aitor Gonzalez-Agirre, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas

    Abstract: This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which can arguably be presented as the largest and most proficient language models in Spanish. The models were pretrained using a massive corpus of 570GB… ▽ More

    Submitted 5 April, 2022; v1 submitted 15 July, 2021; originally announced July 2021.

    Journal ref: Procesamiento del Lenguaje Natural, v. 68, p. 39-60, mar. 2022. ISSN 1989-7553

  16. arXiv:2106.00012  [pdf, other

    cs.LG cs.AI math.AT

    Persistent Homology Captures the Generalization of Neural Networks Without A Validation Set

    Authors: Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Marta Villegas

    Abstract: The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model. This is done instead of measuring intrinsic properties of the model to determine whether it is learning appropriately. In this work, we suggest studying the training of neural networks with Algebraic Topology, specifically Persistent Homology (PH). Using simplicial comp… ▽ More

    Submitted 31 May, 2021; originally announced June 2021.

  17. arXiv:2102.12843  [pdf, ps, other

    cs.CL cs.AI

    Spanish Biomedical and Clinical Language Embeddings

    Authors: Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Casimiro Pio Carrino, Ona De Gibert, Aitor Gonzalez-Agirre, Marta Villegas

    Abstract: We computed both Word and Sub-word Embeddings using FastText. For Sub-word embeddings we selected Byte Pair Encoding (BPE) algorithm to represent the sub-words. We evaluated the Biomedical Word Embeddings obtaining better results than previous versions showing the implication that with more data, we obtain better representations.

    Submitted 25 February, 2021; originally announced February 2021.

  18. arXiv:2101.07752  [pdf, other

    cs.LG math.AT

    Characterizing and Measuring the Similarity of Neural Networks with Persistent Homology

    Authors: David Pérez-Fernández, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

    Abstract: Characterizing the structural properties of neural networks is crucial yet poorly understood, and there are no well-established similarity measures between networks. In this work, we observe that neural networks can be represented as abstract simplicial complex and analyzed using their topological 'fingerprints' via Persistent Homology (PH). We then describe a PH-based representation proposed for… ▽ More

    Submitted 31 May, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

  19. arXiv:2012.11699  [pdf, other

    cs.CR cs.SI

    A Vulnerability Study on Academic Collaboration Networks Based on Network Dynamics

    Authors: Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

    Abstract: Researchers that work for the same institution use their email as the main communication tool. Email can be one of the most fruitful attack vectors of research institutions as they also contain access to all accounts and thus to all private information. We propose an approach for analyzing in terms of security research institutions' communication networks. We first obtained institutions' communica… ▽ More

    Submitted 31 March, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

  20. arXiv:2004.08053  [pdf, other

    cs.CL

    Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation

    Authors: Jordi Armengol-Estapé, Marta R. Costa-jussà, Carlos Escolano

    Abstract: Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external… ▽ More

    Submitted 24 December, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

    ACM Class: I.2.7