-
A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
Authors:
Masaaki Nagata,
Makoto Morishita,
Katsuki Chousa,
Norihito Yasuda
Abstract:
Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs t…
▽ More
Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Performance Evaluation of CMOS Annealing with Support Vector Machine
Authors:
Ryoga Fukuhara,
Makoto Morishita,
Takahiro Katagiri,
Masatoshi Kawai,
Toru Nagai,
Tetsuya Hoshino
Abstract:
In this paper, support vector machine (SVM) performance was assessed utilizing a quantum-inspired complementary metal-oxide semiconductor (CMOS) annealer. The primary focus during performance evaluation was the accuracy rate in binary classification problems. A comparative analysis was conducted between SVM running on a CPU (classical computation) and executed on a quantum-inspired annealer. The p…
▽ More
In this paper, support vector machine (SVM) performance was assessed utilizing a quantum-inspired complementary metal-oxide semiconductor (CMOS) annealer. The primary focus during performance evaluation was the accuracy rate in binary classification problems. A comparative analysis was conducted between SVM running on a CPU (classical computation) and executed on a quantum-inspired annealer. The performance outcome was evaluated using a CMOS annealing machine, thereby obtaining an accuracy rate of 93.7% for linearly separable problems, 92.7% for non-linearly separable problem 1, and 97.6% for non-linearly separable problem 2. These results reveal that a CMOS annealing machine can achieve an accuracy rate that closely rivals that of classical computation.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
WikiSplit++: Easy Data Refinement for Split and Rephrase
Authors:
Hayato Tsukagoshi,
Tsutomu Hirao,
Makoto Morishita,
Katsuki Chousa,
Ryohei Sasano,
Koichi Takeda
Abstract:
The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still…
▽ More
The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Generating Diverse Translation with Perturbed kNN-MT
Authors:
Yuto Nishida,
Makoto Morishita,
Hidetaka Kamigaito,
Taro Watanabe
Abstract:
Generating multiple translation candidates would enable users to choose the one that satisfies their needs. Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem -- the model underestimates a prediction that is largely different from the training data, even if that predicti…
▽ More
Generating multiple translation candidates would enable users to choose the one that satisfies their needs. Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem -- the model underestimates a prediction that is largely different from the training data, even if that prediction is likely. This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT). Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem. Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation's magnitude.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Refactoring Programs Using Large Language Models with Few-Shot Examples
Authors:
Atsushi Shirafuji,
Yusuke Oda,
Jun Suzuki,
Makoto Morishita,
Yutaka Watanobe
Abstract:
A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the appl…
▽ More
A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user-written Python program, aiming to encourage users to learn how to write better programs. We propose a method to leverage the prompting with few-shot examples of the LLM by selecting the best-suited code refactoring examples for each target programming problem based on the prior evaluation of prompting with the one-shot example. The quantitative evaluation shows that 95.68% of programs can be refactored by generating 10 candidates each, resulting in a 17.35% reduction in the average cyclomatic complexity and a 25.84% decrease in the average number of lines after filtering only generated programs that are semantically correct. Furthermore, the qualitative evaluation shows outstanding capability in code formatting, while unnecessary behaviors such as deleting or translating comments are also observed.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Chat Translation Error Detection for Assisting Cross-lingual Communications
Authors:
Yunmeng Li,
Jun Suzuki,
Makoto Morishita,
Kaori Abe,
Ryoko Tokuhisa,
Ana Brassard,
Kentaro Inui
Abstract:
In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chat…
▽ More
In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chats augmented with crowdsourced quality ratings. The error detector can serve as an encouraging foundation for more advanced erroneous translation detection systems.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Exploring the Robustness of Large Language Models for Solving Programming Problems
Authors:
Atsushi Shirafuji,
Yutaka Watanobe,
Takumi Ito,
Makoto Morishita,
Yuki Nakamura,
Yusuke Oda,
Jun Suzuki
Abstract:
Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in traini…
▽ More
Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Domain Adaptation of Machine Translation with Crowdworkers
Authors:
Makoto Morishita,
Jun Suzuki,
Masaaki Nagata
Abstract:
Although a machine translation model trained with a large in-domain parallel corpus achieves remarkable results, it still works poorly when no in-domain data are available. This situation restricts the applicability of machine translation when the target domain's data are limited. However, there is great demand for high-quality domain-specific machine translation models for many domains. We propos…
▽ More
Although a machine translation model trained with a large in-domain parallel corpus achieves remarkable results, it still works poorly when no in-domain data are available. This situation restricts the applicability of machine translation when the target domain's data are limited. However, there is great demand for high-quality domain-specific machine translation models for many domains. We propose a framework that efficiently and effectively collects parallel sentences in a target domain from the web with the help of crowdworkers. With the collected parallel data, we can quickly adapt a machine translation model to the target domain. Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost. We tested it with five domains, and the domain-adapted model improved the BLEU scores to +19.7 by an average of +7.8 points compared to a general-purpose translation model.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus
Authors:
Makoto Morishita,
Katsuki Chousa,
Jun Suzuki,
Masaaki Nagata
Abstract:
Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel cor…
▽ More
Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available, compared to such resource-rich languages as English-German. It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0. Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus. Through experiments, we empirically show how our new corpus boosts the accuracy of machine translation models on various domains. The JParaCrawl v3.0 corpus will eventually be publicly available online for research purposes.
△ Less
Submitted 28 February, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Input Augmentation Improves Constrained Beam Search for Neural Machine Translation: NTT at WAT 2021
Authors:
Katsuki Chousa,
Makoto Morishita
Abstract:
This paper describes our systems that were submitted to the restricted translation task at WAT 2021. In this task, the systems are required to output translated sentences that contain all given word constraints. Our system combined input augmentation and constrained beam search algorithms. Through experiments, we found that this combination significantly improves translation accuracy and can save…
▽ More
This paper describes our systems that were submitted to the restricted translation task at WAT 2021. In this task, the systems are required to output translated sentences that contain all given word constraints. Our system combined input augmentation and constrained beam search algorithms. Through experiments, we found that this combination significantly improves translation accuracy and can save inference time while containing all the constraints in the output. For both En->Ja and Ja->En, our systems obtained the best evaluation performances in automatic evaluation.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents
Authors:
Ryo Fujii,
Masato Mita,
Kaori Abe,
Kazuaki Hanawa,
Makoto Morishita,
Jun Suzuki,
Kentaro Inui
Abstract:
Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising dire…
▽ More
Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Recovery command generation towards automatic recovery in ICT systems by Seq2Seq learning
Authors:
Hiroki Ikeuchi,
Akio Watanabe,
Tsutomu Hirao,
Makoto Morishita,
Masaaki Nishino,
Yoichi Matsuo,
Keishiro Watanabe
Abstract:
With the increase in scale and complexity of ICT systems, their operation increasingly requires automatic recovery from failures. Although it has become possible to automatically detect anomalies and analyze root causes of failures with current methods, making decisions on what commands should be executed to recover from failures still depends on manual operation, which is quite time-consuming. To…
▽ More
With the increase in scale and complexity of ICT systems, their operation increasingly requires automatic recovery from failures. Although it has become possible to automatically detect anomalies and analyze root causes of failures with current methods, making decisions on what commands should be executed to recover from failures still depends on manual operation, which is quite time-consuming. Toward automatic recovery, we propose a method of estimating recovery commands by using Seq2Seq, a neural network model. This model learns complex relationships between logs obtained from equipment and recovery commands that operators executed in the past. When a new failure occurs, our method estimates plausible commands that recover from the failure on the basis of collected logs. We conducted experiments using a synthetic dataset and realistic OpenStack dataset, demonstrating that our method can estimate recovery commands with high accuracy.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
Authors:
Makoto Morishita,
Jun Suzuki,
Masaaki Nagata
Abstract:
Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web an…
▽ More
Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.
△ Less
Submitted 15 March, 2020; v1 submitted 24 November, 2019;
originally announced November 2019.
-
NTT's Machine Translation Systems for WMT19 Robustness Task
Authors:
Soichiro Murakami,
Makoto Morishita,
Tsutomu Hirao,
Masaaki Nagata
Abstract:
This paper describes NTT's submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previo…
▽ More
This paper describes NTT's submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previous baseline. Experimental results revealed the placeholder mechanism, which temporarily replaces the non-standard tokens including emojis and emoticons with special placeholder tokens during translation, improves translation accuracy even with noisy texts.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.
-
An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation
Authors:
Makoto Morishita,
Yusuke Oda,
Graham Neubig,
Koichiro Yoshino,
Katsuhito Sudoh,
Satoshi Nakamura
Abstract:
Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the a…
▽ More
Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the amount of padding and increases the processing speed. However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated or compared. This work investigates mini-batch creation strategies with experiments over two different datasets. Our results suggest that the choice of a mini-batch creation strategy has a large effect on NMT training and some length-based sorting strategies do not always work well compared with simple shuffling.
△ Less
Submitted 18 June, 2017;
originally announced June 2017.
-
Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015
Authors:
Graham Neubig,
Makoto Morishita,
Satoshi Nakamura
Abstract:
This year, the Nara Institute of Science and Technology (NAIST)'s submission to the 2015 Workshop on Asian Translation was based on syntax-based statistical machine translation, with the addition of a reranking component using neural attentional machine translation models. Experiments re-confirmed results from previous work stating that neural MT reranking provides a large gain in objective evalua…
▽ More
This year, the Nara Institute of Science and Technology (NAIST)'s submission to the 2015 Workshop on Asian Translation was based on syntax-based statistical machine translation, with the addition of a reranking component using neural attentional machine translation models. Experiments re-confirmed results from previous work stating that neural MT reranking provides a large gain in objective evaluation measures such as BLEU, and also confirmed for the first time that these results also carry over to manual evaluation. We further perform a detailed analysis of reasons for this increase, finding that the main contributions of the neural models lie in improvement of the grammatical correctness of the output, as opposed to improvements in lexical choice of content words.
△ Less
Submitted 18 October, 2015;
originally announced October 2015.