-
Spanish Pre-trained BERT Model and Evaluation Data
Authors:
José Cañete,
Gabriel Chaperon,
Rodrigo Fuentes,
Jou-Hui Ho,
Ho** Kang,
Jorge Pérez
Abstract:
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single re…
▽ More
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.
△ Less
Submitted 5 August, 2023;
originally announced August 2023.
-
An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)
Authors:
Sijia Liu,
Andrew Wen,
Liwei Wang,
Huan He,
Sunyang Fu,
Robert Miller,
Andrew Williams,
Daniel Harris,
Ramakanth Kavuluru,
Mei Liu,
Noor Abu-el-rub,
Dalton Schutte,
Rui Zhang,
Masoud Rouhizadeh,
John D. Osborne,
Yongqun He,
Umit Topaloglu,
Stephanie S Hong,
Joel H Saltz,
Thomas Schaffter,
Emily Pfaff,
Christopher G. Chute,
Tim Duong,
Melissa A. Haendel,
Rafael Fuentes
, et al. (7 additional authors not shown)
Abstract:
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algori…
▽ More
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP.
△ Less
Submitted 21 March, 2022; v1 submitted 20 October, 2021;
originally announced October 2021.
-
Swipe dynamics as a means of authentication: results from a Bayesian unsupervised approach
Authors:
Parker Lamb,
Alexander Millar,
Ramon Fuentes
Abstract:
The field of behavioural biometrics stands as an appealing alternative to more traditional biometric systems due to the ease of use from a user perspective and potential robustness to presentation attacks. This paper focuses its attention to a specific type of behavioural biometric utilising swipe dynamics, also referred to as touch gestures. In touch gesture authentication, a user swipes across t…
▽ More
The field of behavioural biometrics stands as an appealing alternative to more traditional biometric systems due to the ease of use from a user perspective and potential robustness to presentation attacks. This paper focuses its attention to a specific type of behavioural biometric utilising swipe dynamics, also referred to as touch gestures. In touch gesture authentication, a user swipes across the touchscreen of a mobile device to perform an authentication attempt. A key characteristic of touch gesture authentication and new behavioural biometrics in general is the lack of available data to train and validate models. From a machine learning perspective, this presents the classic curse of dimensionality problem and the methodology presented here focuses on Bayesian unsupervised models as they are well suited to such conditions. This paper presents results from a set of experiments consisting of 38 sessions with labelled victim as well as blind and over-the-shoulder presentation attacks. Three models are compared using this dataset; two single-mode models: a shrunk covariance estimate and a Bayesian Gaussian distribution, as well as a Bayesian non-parametric infinite mixture of Gaussians, modelled as a Dirichlet Process. Equal error rates (EER) for the three models are compared and attention is paid to how these vary across the two single-mode models at differing numbers of enrolment samples.
△ Less
Submitted 13 October, 2020; v1 submitted 27 July, 2020;
originally announced August 2020.
-
O2A: One-shot Observational learning with Action vectors
Authors:
Leo Pauly,
Wisdom C. Agboh,
David C. Hogg,
Raul Fuentes
Abstract:
We present O2A, a novel method for learning to perform robotic manipulation tasks from a single (one-shot) third-person demonstration video. To our knowledge, it is the first time this has been done for a single demonstration. The key novelty lies in pre-training a feature extractor for creating a perceptual representation for actions that we call 'action vectors'. The action vectors are extracted…
▽ More
We present O2A, a novel method for learning to perform robotic manipulation tasks from a single (one-shot) third-person demonstration video. To our knowledge, it is the first time this has been done for a single demonstration. The key novelty lies in pre-training a feature extractor for creating a perceptual representation for actions that we call 'action vectors'. The action vectors are extracted using a 3D-CNN model pre-trained as an action classifier on a generic action dataset. The distance between the action vectors from the observed third-person demonstration and trial robot executions is used as a reward for reinforcement learning of the demonstrated task. We report on experiments in simulation and on a real robot, with changes in viewpoint of observation, properties of the objects involved, scene background and morphology of the manipulator between the demonstration and the learning domains. O2A outperforms baseline approaches under different domain shifts and has comparable performance with an oracle (that uses an ideal reward function).
△ Less
Submitted 21 December, 2020; v1 submitted 17 October, 2018;
originally announced October 2018.
-
ViTac: Feature Sharing between Vision and Tactile Sensing for Cloth Texture Recognition
Authors:
Shan Luo,
Wenzhen Yuan,
Edward Adelson,
Anthony G. Cohn,
Raul Fuentes
Abstract:
Vision and touch are two of the important sensing modalities for humans and they offer complementary information for sensing the environment. Robots could also benefit from such multi-modal sensing ability. In this paper, addressing for the first time (to the best of our knowledge) texture recognition from tactile images and vision, we propose a new fusion method named Deep Maximum Covariance Anal…
▽ More
Vision and touch are two of the important sensing modalities for humans and they offer complementary information for sensing the environment. Robots could also benefit from such multi-modal sensing ability. In this paper, addressing for the first time (to the best of our knowledge) texture recognition from tactile images and vision, we propose a new fusion method named Deep Maximum Covariance Analysis (DMCA) to learn a joint latent space for sharing features through vision and tactile sensing. The features of camera images and tactile data acquired from a GelSight sensor are learned by deep neural networks. But the learned features are of a high dimensionality and are redundant due to the differences between the two sensing modalities, which deteriorates the perception performance. To address this, the learned features are paired using maximum covariance analysis. Results of the algorithm on a newly collected dataset of paired visual and tactile data relating to cloth textures show that a good recognition performance of greater than 90\% can be achieved by using the proposed DMCA framework. In addition, we find that the perception performance of either vision or tactile sensing can be improved by employing the shared representation space, compared to learning from unimodal data.
△ Less
Submitted 13 March, 2018; v1 submitted 21 February, 2018;
originally announced February 2018.