-
A Multilingual Neural Machine Translation Model for Biomedical Data
Authors:
Alexandre Bérard,
Zae Myung Kim,
Vassilina Nikoulina,
Eunjeong L. Park,
Matthias Gallé
Abstract:
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic and biomedical data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) and…
▽ More
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic and biomedical data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) and biomedical test sets, and that it outperforms the existing publicly released models. We believe that this release will help the large-scale multilingual analysis of the digital content of the COVID-19 crisis and of its effects on society, economy, and healthcare policies.
We also release a test set of biomedical text for Korean-English. It consists of 758 sentences from official guidelines and recent papers, all about COVID-19.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
Revisiting Round-Trip Translation for Quality Estimation
Authors:
Jihyung Moon,
Hyunchang Cho,
Eunjeong L. Park
Abstract:
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references. Calculating BLEU between the input sentence and round-trip translation (RTT) was once considered as a metric for QE, however, it was found to be a poor predictor of translation quality. Recently, various pre-trained language models have made breakthroughs in NLP tasks by…
▽ More
Quality estimation (QE) is the task of automatically evaluating the quality of translations without human-translated references. Calculating BLEU between the input sentence and round-trip translation (RTT) was once considered as a metric for QE, however, it was found to be a poor predictor of translation quality. Recently, various pre-trained language models have made breakthroughs in NLP tasks by providing semantically meaningful word and sentence embeddings. In this paper, we employ semantic embeddings to RTT-based QE. Our method achieves the highest correlations with human judgments, compared to previous WMT 2019 quality estimation metric task submissions. While backward translation models can be a drawback when using RTT, we observe that with semantic-level metrics, RTT-based QE is robust to the choice of the backward translation system. Additionally, the proposed method shows consistent performance for both SMT and NMT forward translation systems, implying the method does not penalize a certain type of model.
△ Less
Submitted 28 April, 2020;
originally announced April 2020.
-
Deep Patent Landsca** Model Using Transformer and Graph Embedding
Authors:
Seokkyu Choi,
Hyeonju Lee,
Eunjeong Lucy Park,
Sungchul Choi
Abstract:
Patent landsca** is a method used for searching related patents during a research and development (R&D) project. To avoid the risk of patent infringement and to follow current trends in technology, patent landsca** is a crucial task required during the early stages of an R&D project. As the process of patent landsca** requires advanced resources and can be tedious, the demand for automated p…
▽ More
Patent landsca** is a method used for searching related patents during a research and development (R&D) project. To avoid the risk of patent infringement and to follow current trends in technology, patent landsca** is a crucial task required during the early stages of an R&D project. As the process of patent landsca** requires advanced resources and can be tedious, the demand for automated patent landsca** has been gradually increasing. However, a shortage of well-defined benchmark datasets and comparable models makes it difficult to find related research studies. In this paper, we propose an automated patent landsca** model based on deep learning. To analyze the text of patents, the proposed model uses a modified transformer structure. To analyze the metadata of patents, we propose a graph embedding method that uses a diffusion graph called Diff2Vec. Furthermore, we introduce four benchmark datasets for comparing related research studies in patent landsca**. The datasets are produced by querying Google BigQuery, based on a search formula from a Korean patent attorney. The obtained results indicate that the proposed model and datasets can attain state-of-the-art performance, as compared with current patent landsca** models.
△ Less
Submitted 21 November, 2019; v1 submitted 14 March, 2019;
originally announced March 2019.
-
Hybrid Machine Learning Approach to Popularity Prediction of Newly Released Contents for Online Video Streaming Service
Authors:
Hongjun Jeon,
Wonchul Seo,
Eunjeong Lucy Park,
Sungchul Choi
Abstract:
In the industry of video content providers such as VOD and IPTV, predicting the popularity of video contents in advance is critical not only from a marketing perspective but also from a network optimization perspective. By predicting whether the content will be successful or not in advance, the content file, which is large, is efficiently deployed in the proper service providing server, leading to…
▽ More
In the industry of video content providers such as VOD and IPTV, predicting the popularity of video contents in advance is critical not only from a marketing perspective but also from a network optimization perspective. By predicting whether the content will be successful or not in advance, the content file, which is large, is efficiently deployed in the proper service providing server, leading to network cost optimization. Many previous studies have done view count prediction research to do this. However, the studies have been making predictions based on historical view count data from users. In this case, the contents had been published to the users and already deployed on a service server. These approaches make possible to efficiently deploy a content already published but are impossible to use for a content that is not be published. To address the problems, this research proposes a hybrid machine learning approach to the classification model for the popularity prediction of newly video contents which is not published. In this paper, we create a new variable based on the related content of the specific content and divide entire dataset by the characteristics of the contents. Next, the prediction is performed using XGBoosting and deep neural net based model according to the data characteristics of the cluster. Our model uses metadata for contents for prediction, so we use categorical embedding techniques to solve the sparsity of categorical variables and make them learn efficiently for the deep neural net model. As well, we use the FTRL-proximal algorithm to solve the problem of the view-count volatility of video content. We achieve overall better performance than the previous standalone method with a dataset from one of the top streaming service company.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.