Search | arXiv e-print repository

Spanish Pre-trained BERT Model and Evaluation Data

Authors: José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Ho** Kang, Jorge Pérez

Abstract: The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single re… ▽ More The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: Published as workshop paper at Practical ML for Develo** Countries Workshop @ ICLR 2020

arXiv:2110.10780 [pdf]

An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)

Authors: Sijia Liu, Andrew Wen, Liwei Wang, Huan He, Sunyang Fu, Robert Miller, Andrew Williams, Daniel Harris, Ramakanth Kavuluru, Mei Liu, Noor Abu-el-rub, Dalton Schutte, Rui Zhang, Masoud Rouhizadeh, John D. Osborne, Yongqun He, Umit Topaloglu, Stephanie S Hong, Joel H Saltz, Thomas Schaffter, Emily Pfaff, Christopher G. Chute, Tim Duong, Melissa A. Haendel, Rafael Fuentes , et al. (7 additional authors not shown)

Abstract: While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algori… ▽ More While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP. △ Less

Submitted 21 March, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

Comments: update on contents

arXiv:2008.01013 [pdf, other]

doi 10.1109/IJCB48548.2020.9304876

Swipe dynamics as a means of authentication: results from a Bayesian unsupervised approach

Authors: Parker Lamb, Alexander Millar, Ramon Fuentes

Abstract: The field of behavioural biometrics stands as an appealing alternative to more traditional biometric systems due to the ease of use from a user perspective and potential robustness to presentation attacks. This paper focuses its attention to a specific type of behavioural biometric utilising swipe dynamics, also referred to as touch gestures. In touch gesture authentication, a user swipes across t… ▽ More The field of behavioural biometrics stands as an appealing alternative to more traditional biometric systems due to the ease of use from a user perspective and potential robustness to presentation attacks. This paper focuses its attention to a specific type of behavioural biometric utilising swipe dynamics, also referred to as touch gestures. In touch gesture authentication, a user swipes across the touchscreen of a mobile device to perform an authentication attempt. A key characteristic of touch gesture authentication and new behavioural biometrics in general is the lack of available data to train and validate models. From a machine learning perspective, this presents the classic curse of dimensionality problem and the methodology presented here focuses on Bayesian unsupervised models as they are well suited to such conditions. This paper presents results from a set of experiments consisting of 38 sessions with labelled victim as well as blind and over-the-shoulder presentation attacks. Three models are compared using this dataset; two single-mode models: a shrunk covariance estimate and a Bayesian Gaussian distribution, as well as a Bayesian non-parametric infinite mixture of Gaussians, modelled as a Dirichlet Process. Equal error rates (EER) for the three models are compared and attention is paid to how these vary across the two single-mode models at differing numbers of enrolment samples. △ Less

Submitted 13 October, 2020; v1 submitted 27 July, 2020; originally announced August 2020.

Comments: 9 pages, 7 figures; Layout and editing improved

arXiv:1810.07483 [pdf, other]

doi 10.3389/frobt.2021.686368

O2A: One-shot Observational learning with Action vectors

Authors: Leo Pauly, Wisdom C. Agboh, David C. Hogg, Raul Fuentes

Abstract: We present O2A, a novel method for learning to perform robotic manipulation tasks from a single (one-shot) third-person demonstration video. To our knowledge, it is the first time this has been done for a single demonstration. The key novelty lies in pre-training a feature extractor for creating a perceptual representation for actions that we call 'action vectors'. The action vectors are extracted… ▽ More We present O2A, a novel method for learning to perform robotic manipulation tasks from a single (one-shot) third-person demonstration video. To our knowledge, it is the first time this has been done for a single demonstration. The key novelty lies in pre-training a feature extractor for creating a perceptual representation for actions that we call 'action vectors'. The action vectors are extracted using a 3D-CNN model pre-trained as an action classifier on a generic action dataset. The distance between the action vectors from the observed third-person demonstration and trial robot executions is used as a reward for reinforcement learning of the demonstrated task. We report on experiments in simulation and on a real robot, with changes in viewpoint of observation, properties of the objects involved, scene background and morphology of the manipulator between the demonstration and the learning domains. O2A outperforms baseline approaches under different domain shifts and has comparable performance with an oracle (that uses an ideal reward function). △ Less

Submitted 21 December, 2020; v1 submitted 17 October, 2018; originally announced October 2018.

Journal ref: Front. Robot. AI 8:686368 (2021)

arXiv:1802.07490 [pdf, other]

ViTac: Feature Sharing between Vision and Tactile Sensing for Cloth Texture Recognition

Authors: Shan Luo, Wenzhen Yuan, Edward Adelson, Anthony G. Cohn, Raul Fuentes

Abstract: Vision and touch are two of the important sensing modalities for humans and they offer complementary information for sensing the environment. Robots could also benefit from such multi-modal sensing ability. In this paper, addressing for the first time (to the best of our knowledge) texture recognition from tactile images and vision, we propose a new fusion method named Deep Maximum Covariance Anal… ▽ More Vision and touch are two of the important sensing modalities for humans and they offer complementary information for sensing the environment. Robots could also benefit from such multi-modal sensing ability. In this paper, addressing for the first time (to the best of our knowledge) texture recognition from tactile images and vision, we propose a new fusion method named Deep Maximum Covariance Analysis (DMCA) to learn a joint latent space for sharing features through vision and tactile sensing. The features of camera images and tactile data acquired from a GelSight sensor are learned by deep neural networks. But the learned features are of a high dimensionality and are redundant due to the differences between the two sensing modalities, which deteriorates the perception performance. To address this, the learned features are paired using maximum covariance analysis. Results of the algorithm on a newly collected dataset of paired visual and tactile data relating to cloth textures show that a good recognition performance of greater than 90\% can be achieved by using the proposed DMCA framework. In addition, we find that the perception performance of either vision or tactile sensing can be improved by employing the shared representation space, compared to learning from unimodal data. △ Less

Submitted 13 March, 2018; v1 submitted 21 February, 2018; originally announced February 2018.

Comments: 6 pages, 5 figures, Accepted for 2018 IEEE International Conference on Robotics and Automation

Showing 1–5 of 5 results for author: Fuentes, R