-
Social Bias Evaluation for Large Language Models Requires Prompt Variations
Authors:
Rem Hida,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks as prompts to examine the degree of social biases for evaluation and mitigation. While LLMs' output highly depends on prompts, previous studies evaluat…
▽ More
Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks as prompts to examine the degree of social biases for evaluation and mitigation. While LLMs' output highly depends on prompts, previous studies evaluating and mitigating bias have often relied on a limited variety of prompts. In this paper, we investigate the sensitivity of LLMs when changing prompt variations (task instruction and prompt, few-shot examples, debias-prompt) by analyzing task performance and social bias of LLMs. Our experimental results reveal that LLMs are highly sensitive to prompts to the extent that the ranking of LLMs fluctuates when comparing models for task performance and social bias. Additionally, we show that LLMs have tradeoffs between performance and social bias caused by the prompts. Less bias from prompt setting may result in reduced performance. Moreover, the ambiguity of instances is one of the reasons for this sensitivity to prompts in advanced LLMs, leading to various outputs. We recommend using diverse prompts, as in this study, to compare the effects of prompts on social bias in LLMs.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Sampling-based Pseudo-Likelihood for Membership Inference Attacks
Authors:
Masahiro Kaneko,
Youmi Ma,
Yuki Wata,
Naoaki Okazaki
Abstract:
Large Language Models (LLMs) are trained on large-scale web data, which makes it difficult to grasp the contribution of each text. This poses the risk of leaking inappropriate data such as benchmarks, personal information, and copyrighted texts in the training data. Membership Inference Attacks (MIA), which determine whether a given text is included in the model's training data, have been attracti…
▽ More
Large Language Models (LLMs) are trained on large-scale web data, which makes it difficult to grasp the contribution of each text. This poses the risk of leaking inappropriate data such as benchmarks, personal information, and copyrighted texts in the training data. Membership Inference Attacks (MIA), which determine whether a given text is included in the model's training data, have been attracting attention. Previous studies of MIAs revealed that likelihood-based classification is effective for detecting leaks in LLMs. However, the existing methods cannot be applied to some proprietary models like ChatGPT or Claude 3 because the likelihood is unavailable to the user. In this study, we propose a Sampling-based Pseudo-Likelihood (\textbf{SPL}) method for MIA (\textbf{SaMIA}) that calculates SPL using only the text generated by an LLM to detect leaks. The SaMIA treats the target text as the reference text and multiple outputs from the LLM as text samples, calculates the degree of $n$-gram match as SPL, and determines the membership of the text in the training data. Even without likelihoods, SaMIA performed on par with existing likelihood-based methods.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish
Authors:
Masahiro Kaneko,
Timothy Baldwin
Abstract:
Large Language Models (LLMs) are trained on massive web-crawled corpora. This poses risks of leakage, including personal information, copyrighted texts, and benchmark datasets. Such leakage leads to undermining human trust in AI due to potential unauthorized generation of content or overestimation of performance. We establish the following three criteria concerning the leakage issues: (1) leakage…
▽ More
Large Language Models (LLMs) are trained on massive web-crawled corpora. This poses risks of leakage, including personal information, copyrighted texts, and benchmark datasets. Such leakage leads to undermining human trust in AI due to potential unauthorized generation of content or overestimation of performance. We establish the following three criteria concerning the leakage issues: (1) leakage rate: the proportion of leaked data in training data, (2) output rate: the ease of generating leaked data, and (3) detection rate: the detection performance of leaked versus non-leaked data. Despite the leakage rate being the origin of data leakage issues, it is not understood how it affects the output rate and detection rate. In this paper, we conduct an experimental survey to elucidate the relationship between the leakage rate and both the output rate and detection rate for personal information, copyrighted texts, and benchmark data. Additionally, we propose a self-detection approach that uses few-shot learning in which LLMs detect whether instances are present or absent in their training data, in contrast to previous methods that do not employ explicit learning. To explore the ease of generating leaked information, we create a dataset of prompts designed to elicit personal information, copyrighted text, and benchmarks from LLMs. Our experiments reveal that LLMs produce leaked information in most cases despite less such data in their training set. This indicates even small amounts of leaked data can greatly affect outputs. Our self-detection method showed superior performance compared to existing detection methods.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Robust Locomotion via Zero-order Stochastic Nonlinear Model Predictive Control with Guard Saltation Matrix
Authors:
Sotaro Katayama,
Noriaki Takasugi,
Mitsuhisa Kaneko,
Norio Nagatsuka,
and Masaya Kinoshita
Abstract:
This paper presents a stochastic/robust nonlinear model predictive control (NMPC) to enhance the robustness of legged locomotion against contact uncertainties. We integrate the contact uncertainties into the covariance propagation of stochastic/robust NMPC framework by leveraging the guard saltation matrix and an extended Kalman filter-like covariance update. We achieve fast stochastic/robust NMPC…
▽ More
This paper presents a stochastic/robust nonlinear model predictive control (NMPC) to enhance the robustness of legged locomotion against contact uncertainties. We integrate the contact uncertainties into the covariance propagation of stochastic/robust NMPC framework by leveraging the guard saltation matrix and an extended Kalman filter-like covariance update. We achieve fast stochastic/robust NMPC computation by utilizing the zero-order stochastic/robust NMPC algorithm with additional improvements in computational efficiency concerning the feedback gains. We conducted numerical experiments and demonstrate that the proposed method can accurately forecast future state covariance and generate trajectories that satisfies constraints even in the presence of the contact uncertainties. Hardware experiments on the perceptive locomotion of a wheeled-legged robot were also carried out, validating the feasibility of the proposed method in a real-world system with limited on-board computation.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Likelihood-based Mitigation of Evaluation Bias in Large Language Models
Authors:
Masanari Ohi,
Masahiro Kaneko,
Ryuto Koike,
Mengsay Loem,
Naoaki Okazaki
Abstract:
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate s…
▽ More
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.
△ Less
Submitted 1 March, 2024; v1 submitted 24 February, 2024;
originally announced February 2024.
-
Eagle: Ethical Dataset Given from Real Interactions
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Timothy Baldwin
Abstract:
Recent studies have demonstrated that large language models (LLMs) have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. The existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. Therefore, the data does not refl…
▽ More
Recent studies have demonstrated that large language models (LLMs) have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. The existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. Therefore, the data does not reflect prompts that users actually provide when utilizing LLM services in everyday contexts. This may not lead to the development of safe LLMs that can address ethical challenges arising in real-world applications. In this paper, we create Eagle datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems. Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges. Our code is publicly available at https://huggingface.co/datasets/MasahiroKaneko/eagle.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Naoaki Okazaki,
Timothy Baldwin
Abstract:
There exist both scalable tasks, like reading comprehension and fact-checking, where model performance improves with model size, and unscalable tasks, like arithmetic reasoning and symbolic reasoning, where model performance does not necessarily improve with model size. Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions eve…
▽ More
There exist both scalable tasks, like reading comprehension and fact-checking, where model performance improves with model size, and unscalable tasks, like arithmetic reasoning and symbolic reasoning, where model performance does not necessarily improve with model size. Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks. Unfortunately, despite their exceptional reasoning abilities, LLMs tend to internalize and reproduce discriminatory societal biases. Whether CoT can provide discriminatory or egalitarian rationalizations for the implicit information in unscalable tasks remains an open question.
In this study, we examine the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks. For this purpose, we construct a benchmark for an unscalable task where the LLM is given a list of words comprising feminine, masculine, and gendered occupational words, and is required to count the number of feminine and masculine words. In our CoT prompts, we require the LLM to explicitly indicate whether each word in the word list is a feminine or masculine before making the final predictions. With counting and handling the meaning of words, this benchmark has characteristics of both arithmetic reasoning and symbolic reasoning. Experimental results in English show that without step-by-step prediction, most LLMs make socially biased predictions, despite the task being as simple as counting words. Interestingly, CoT prompting reduces this unconscious social bias in LLMs and encourages fair predictions.
△ Less
Submitted 28 January, 2024;
originally announced January 2024.
-
The Gaps between Pre-train and Downstream Settings in Bias Evaluation and Debiasing
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Timothy Baldwin
Abstract:
The output tendencies of Pre-trained Language Models (PLM) vary markedly before and after Fine-Tuning (FT) due to the updates to the model parameters. These divergences in output tendencies result in a gap in the social biases of PLMs. For example, there exits a low correlation between intrinsic bias scores of a PLM and its extrinsic bias scores under FT-based debiasing methods. Additionally, appl…
▽ More
The output tendencies of Pre-trained Language Models (PLM) vary markedly before and after Fine-Tuning (FT) due to the updates to the model parameters. These divergences in output tendencies result in a gap in the social biases of PLMs. For example, there exits a low correlation between intrinsic bias scores of a PLM and its extrinsic bias scores under FT-based debiasing methods. Additionally, applying FT-based debiasing methods to a PLM leads to a decline in performance in downstream tasks. On the other hand, PLMs trained on large datasets can learn without parameter updates via In-Context Learning (ICL) using prompts. ICL induces smaller changes to PLMs compared to FT-based debiasing methods. Therefore, we hypothesize that the gap observed in pre-trained and FT models does not hold true for debiasing methods that use ICL. In this study, we demonstrate that ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods. Moreover, the performance degradation due to debiasing is also lower in the ICL case compared to that in the FT case.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Versatile Telescopic-Wheeled-Legged Locomotion of Tachyon 3 via Full-Centroidal Nonlinear Model Predictive Control
Authors:
Sotaro Katayama,
Noriaki Takasugi,
Mitsuhisa Kaneko,
Masaya Kinoshita
Abstract:
This paper presents a nonlinear model predictive control (NMPC) toward versatile motion generation for the telescopic-wheeled-legged robot Tachyon 3, the unique hardware structure of which poses challenges in control and motion planning. We apply the full-centroidal NMPC formulation with dedicated constraints that can capture the accurate kinematics and dynamics of Tachyon 3. We have developed a c…
▽ More
This paper presents a nonlinear model predictive control (NMPC) toward versatile motion generation for the telescopic-wheeled-legged robot Tachyon 3, the unique hardware structure of which poses challenges in control and motion planning. We apply the full-centroidal NMPC formulation with dedicated constraints that can capture the accurate kinematics and dynamics of Tachyon 3. We have developed a control pipeline that includes an internal state integrator to apply NMPC to Tachyon 3, the actuators of which employ high-gain position-controllers. We conducted simulation and hardware experiments on the perceptive locomotion of Tachyon 3 over structured terrains and demonstrated that the proposed method can achieve smooth and dynamic motion generation under harsh physical and environmental constraints.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
How You Prompt Matters! Even Task-Oriented Constraints in Instructions Affect LLM-Generated Text Detection
Authors:
Ryuto Koike,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
To combat the misuse of Large Language Models (LLMs), many recent studies have presented LLM-generated-text detectors with promising performance. When users instruct LLMs to generate texts, the instruction can include different constraints depending on the user's need. However, most recent studies do not cover such diverse instruction patterns when creating datasets for LLM detection. In this pape…
▽ More
To combat the misuse of Large Language Models (LLMs), many recent studies have presented LLM-generated-text detectors with promising performance. When users instruct LLMs to generate texts, the instruction can include different constraints depending on the user's need. However, most recent studies do not cover such diverse instruction patterns when creating datasets for LLM detection. In this paper, we reveal that even task-oriented constraints -- constraints that would naturally be included in an instruction and are not related to detection-evasion -- cause existing powerful detectors to have a large variance in detection performance. We focus on student essay writing as a realistic domain and manually create task-oriented constraints based on several factors for essay quality. Our experiments show that the standard deviation (SD) of current detector performance on texts generated by an instruction with such a constraint is significantly larger (up to an SD of 14.4 F1-score) than that by generating texts multiple times or paraphrasing the instruction. We also observe an overall trend where the constraints can make LLM detection more challenging than without them. Finally, our analysis indicates that the high instruction-following ability of LLMs fosters the large impact of such constraints on detection performance.
△ Less
Submitted 12 June, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
SAIE Framework: Support Alone Isn't Enough -- Advancing LLM Training with Adversarial Remarks
Authors:
Mengsay Loem,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Large Language Models (LLMs) can justify or critique their predictions through discussions with other models or humans, thereby enriching their intrinsic understanding of instances. While proactive discussions in the inference phase have been shown to boost performance, such interactions have not been extensively explored during the training phase. We hypothesize that incorporating interactive dis…
▽ More
Large Language Models (LLMs) can justify or critique their predictions through discussions with other models or humans, thereby enriching their intrinsic understanding of instances. While proactive discussions in the inference phase have been shown to boost performance, such interactions have not been extensively explored during the training phase. We hypothesize that incorporating interactive discussions into the training process can enhance the models' understanding and improve their reasoning and verbal expression abilities during inference. This work introduces the SAIE framework, which facilitates supportive and adversarial discussions between learner and partner models. The learner model receives responses from the partner, and its parameters are then updated based on this discussion. This dynamic adjustment process continues throughout the training phase, responding to the evolving outputs of the learner model. Our empirical evaluation across various tasks, including math problems, commonsense reasoning, and multi-domain knowledge, demonstrates that models fine-tuned with the SAIE framework outperform those trained with conventional fine-tuning approaches. Furthermore, our method enhances the models' reasoning capabilities, improving both individual and multi-agent inference performance.
△ Less
Submitted 29 February, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Controlled Generation with Prompt Insertion for Natural Language Explanations in Grammatical Error Correction
Authors:
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
In Grammatical Error Correction (GEC), it is crucial to ensure the user's comprehension of a reason for correction. Existing studies present tokens, examples, and hints as to the basis for correction but do not directly explain the reasons for corrections. Although methods that use Large Language Models (LLMs) to provide direct explanations in natural language have been proposed for various tasks,…
▽ More
In Grammatical Error Correction (GEC), it is crucial to ensure the user's comprehension of a reason for correction. Existing studies present tokens, examples, and hints as to the basis for correction but do not directly explain the reasons for corrections. Although methods that use Large Language Models (LLMs) to provide direct explanations in natural language have been proposed for various tasks, no such method exists for GEC. Generating explanations for GEC corrections involves aligning input and output tokens, identifying correction points, and presenting corresponding explanations consistently. However, it is not straightforward to specify a complex format to generate explanations, because explicit control of generation is difficult with prompts. This study introduces a method called controlled generation with Prompt Insertion (PI) so that LLMs can explain the reasons for corrections in natural language. In PI, LLMs first correct the input text, and then we automatically extract the correction points based on the rules. The extracted correction points are sequentially inserted into the LLM's explanation output as prompts, guiding the LLMs to generate explanations for the correction points. We also create an Explainable GEC (XGEC) dataset of correction reasons by annotating NUCLE, CoNLL2013, and CoNLL2014. Although generations from GPT-3 and ChatGPT using original prompts miss some correction points, the generation control using PI can explicitly guide to describe explanations for all correction points, contributing to improved performance in generating correction reasons.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels
Authors:
Panatchakorn Anantaprayoon,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Discriminatory gender biases have been found in Pre-trained Language Models (PLMs) for multiple languages. In Natural Language Inference (NLI), existing bias evaluation methods have focused on the prediction results of one specific label out of three labels, such as neutral. However, such evaluation methods can be inaccurate since unique biased inferences are associated with unique prediction labe…
▽ More
Discriminatory gender biases have been found in Pre-trained Language Models (PLMs) for multiple languages. In Natural Language Inference (NLI), existing bias evaluation methods have focused on the prediction results of one specific label out of three labels, such as neutral. However, such evaluation methods can be inaccurate since unique biased inferences are associated with unique prediction labels. Addressing this limitation, we propose a bias evaluation method for PLMs, called NLI-CoAL, which considers all the three labels of NLI task. First, we create three evaluation data groups that represent different types of biases. Then, we define a bias measure based on the corresponding label output of each data group. In the experiments, we introduce a meta-evaluation technique for NLI bias measures and use it to confirm that our bias measure can distinguish biased, incorrect inferences from non-biased incorrect inferences better than the baseline, resulting in a more accurate bias evaluation. We create the datasets in English, Japanese, and Chinese, and successfully validate the compatibility of our bias measure across multiple languages. Lastly, we observe the bias tendencies in PLMs of different languages. To our knowledge, we are the first to construct evaluation datasets and measure PLMs' bias from NLI in Japanese and Chinese.
△ Less
Submitted 18 May, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Naoaki Okazaki
Abstract:
Pre-trained language models trained on large-scale data have learned serious levels of social biases. Consequently, various methods have been proposed to debias pre-trained models. Debiasing methods need to mitigate only discriminatory bias information from the pre-trained models, while retaining information that is useful for the downstream tasks. In previous research, whether useful information…
▽ More
Pre-trained language models trained on large-scale data have learned serious levels of social biases. Consequently, various methods have been proposed to debias pre-trained models. Debiasing methods need to mitigate only discriminatory bias information from the pre-trained models, while retaining information that is useful for the downstream tasks. In previous research, whether useful information is retained has been confirmed by the performance of downstream tasks in debiased pre-trained models. On the other hand, it is not clear whether these benchmarks consist of data pertaining to social biases and are appropriate for investigating the impact of debiasing. For example in gender-related social biases, data containing female words (e.g. ``she, female, woman''), male words (e.g. ``he, male, man''), and stereotypical words (e.g. ``nurse, doctor, professor'') are considered to be the most affected by debiasing. If there is not much data containing these words in a benchmark dataset for a target task, there is the possibility of erroneously evaluating the effects of debiasing. In this study, we compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets that containing female, male, and stereotypical words. Experiments show that the effects of debiasing are consistently \emph{underestimated} across all tasks. Moreover, the effects of debiasing could be reliably evaluated by separately considering instances containing female, male, and stereotypical words than all of the instances in a benchmark dataset.
△ Less
Submitted 16 September, 2023;
originally announced September 2023.
-
In-Contextual Gender Bias Suppression for Large Language Models
Authors:
Daisuke Oba,
Masahiro Kaneko,
Danushka Bollegala
Abstract:
Despite their impressive performance in a wide range of NLP tasks, Large Language Models (LLMs) have been reported to encode worrying-levels of gender biases. Prior work has proposed debiasing methods that require human labelled examples, data augmentation and fine-tuning of LLMs, which are computationally costly. Moreover, one might not even have access to the model parameters for performing debi…
▽ More
Despite their impressive performance in a wide range of NLP tasks, Large Language Models (LLMs) have been reported to encode worrying-levels of gender biases. Prior work has proposed debiasing methods that require human labelled examples, data augmentation and fine-tuning of LLMs, which are computationally costly. Moreover, one might not even have access to the model parameters for performing debiasing such as in the case of closed LLMs such as GPT-4. To address this challenge, we propose bias suppression that prevents biased generations of LLMs by simply providing textual preambles constructed from manually designed templates and real-world statistics, without accessing to model parameters. We show that, using CrowsPairs dataset, our textual preambles covering counterfactual statements can suppress gender biases in English LLMs such as LLaMA2. Moreover, we find that gender-neutral descriptions of gender-biased objects can also suppress their gender biases. Moreover, we show that bias suppression has acceptable adverse effect on downstream task performance with HellaSwag and COPA.
△ Less
Submitted 20 February, 2024; v1 submitted 13 September, 2023;
originally announced September 2023.
-
OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples
Authors:
Ryuto Koike,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing L…
▽ More
Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.
△ Less
Submitted 18 February, 2024; v1 submitted 21 July, 2023;
originally announced July 2023.
-
Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods
Authors:
Mengsay Loem,
Masahiro Kaneko,
Sho Takase,
Naoaki Okazaki
Abstract:
Large-scale pre-trained language models such as GPT-3 have shown remarkable performance across various natural language processing tasks. However, applying prompt-based methods with GPT-3 for Grammatical Error Correction (GEC) tasks and their controllability remains underexplored. Controllability in GEC is crucial for real-world applications, particularly in educational settings, where the ability…
▽ More
Large-scale pre-trained language models such as GPT-3 have shown remarkable performance across various natural language processing tasks. However, applying prompt-based methods with GPT-3 for Grammatical Error Correction (GEC) tasks and their controllability remains underexplored. Controllability in GEC is crucial for real-world applications, particularly in educational settings, where the ability to tailor feedback according to learner levels and specific error types can significantly enhance the learning process. This paper investigates the performance and controllability of prompt-based methods with GPT-3 for GEC tasks using zero-shot and few-shot setting. We explore the impact of task instructions and examples on GPT-3's output, focusing on controlling aspects such as minimal edits, fluency edits, and learner levels. Our findings demonstrate that GPT-3 could effectively perform GEC tasks, outperforming existing supervised and unsupervised approaches. We also showed that GPT-3 could achieve controllability when appropriate task instructions and examples are given.
△ Less
Submitted 29 May, 2023;
originally announced May 2023.
-
Reducing Sequence Length by Predicting Edit Operations with Large Language Models
Authors:
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in various tasks and gained significant attention. LLMs are also used for local sequence transduction tasks, including grammatical error correction (GEC) and formality style transfer, where most tokens in a source text are kept unchanged. However, the models that generate all target tokens in such tasks have a tendency to simply…
▽ More
Large Language Models (LLMs) have demonstrated remarkable performance in various tasks and gained significant attention. LLMs are also used for local sequence transduction tasks, including grammatical error correction (GEC) and formality style transfer, where most tokens in a source text are kept unchanged. However, the models that generate all target tokens in such tasks have a tendency to simply copy the input text as is, without making needed changes, because the difference between input and output texts is minimal in the training data. This is also inefficient because the computational cost grows quadratically with the target sequence length with Transformer. This paper proposes predicting edit spans for the source text for local sequence transduction tasks. Representing an edit span with a position of the source text and corrected tokens, we can reduce the length of the target sequence and the computational cost for inference. We apply instruction tuning for LLMs on the supervision data of edit spans. Experiments show that the proposed method achieves comparable performance to the baseline in four tasks, paraphrasing, formality style transfer, GEC, and text simplification, despite reducing the length of the target text by as small as 21%. Furthermore, we report that the task-specific fine-tuning with the proposed method achieved state-of-the-art performance in the four tasks.
△ Less
Submitted 20 October, 2023; v1 submitted 19 May, 2023;
originally announced May 2023.
-
Solving NLP Problems through Human-System Collaboration: A Discussion-based Approach
Authors:
Masahiro Kaneko,
Graham Neubig,
Naoaki Okazaki
Abstract:
Humans work together to solve common problems by having discussions, explaining, and agreeing or disagreeing with each other. Similarly, if a system can have discussions with humans when solving tasks, it can improve the system's performance and reliability. In previous research on explainability, it has only been possible for the system to make predictions and for humans to ask questions about th…
▽ More
Humans work together to solve common problems by having discussions, explaining, and agreeing or disagreeing with each other. Similarly, if a system can have discussions with humans when solving tasks, it can improve the system's performance and reliability. In previous research on explainability, it has only been possible for the system to make predictions and for humans to ask questions about them rather than having a mutual exchange of opinions. This research aims to create a dataset and computational framework for systems that discuss and refine their predictions through dialogue. Through experiments, we show that the proposed system can have beneficial discussions with humans improving the accuracy by up to 25 points in the natural language inference task.
△ Less
Submitted 30 January, 2024; v1 submitted 19 May, 2023;
originally announced May 2023.
-
Comparing Intrinsic Gender Bias Evaluation Measures without using Human Annotated Examples
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Naoaki Okazaki
Abstract:
Numerous types of social biases have been identified in pre-trained language models (PLMs), and various intrinsic bias evaluation measures have been proposed for quantifying those social biases. Prior works have relied on human annotated examples to compare existing intrinsic bias evaluation measures. However, this approach is not easily adaptable to different languages nor amenable to large scale…
▽ More
Numerous types of social biases have been identified in pre-trained language models (PLMs), and various intrinsic bias evaluation measures have been proposed for quantifying those social biases. Prior works have relied on human annotated examples to compare existing intrinsic bias evaluation measures. However, this approach is not easily adaptable to different languages nor amenable to large scale evaluations due to the costs and difficulties when recruiting human annotators. To overcome this limitation, we propose a method to compare intrinsic gender bias evaluation measures without relying on human-annotated examples. Specifically, we create multiple bias-controlled versions of PLMs using varying amounts of male vs. female gendered sentences, mined automatically from an unannotated corpus using gender-related word lists. Next, each bias-controlled PLM is evaluated using an intrinsic bias evaluation measure, and the rank correlation between the computed bias scores and the gender proportions used to fine-tune the PLMs is computed. Experiments on multiple corpora and PLMs repeatedly show that the correlations reported by our proposed method that does not require human annotated examples are comparable to those computed using human annotated examples in prior work.
△ Less
Submitted 27 January, 2023;
originally announced January 2023.
-
Debiasing isn't enough! -- On the Effectiveness of Debiasing MLMs and their Social Biases in Downstream Tasks
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Naoaki Okazaki
Abstract:
We study the relationship between task-agnostic intrinsic and task-specific extrinsic social bias evaluation measures for Masked Language Models (MLMs), and find that there exists only a weak correlation between these two types of evaluation measures. Moreover, we find that MLMs debiased using different methods still re-learn social biases during fine-tuning on downstream tasks. We identify the so…
▽ More
We study the relationship between task-agnostic intrinsic and task-specific extrinsic social bias evaluation measures for Masked Language Models (MLMs), and find that there exists only a weak correlation between these two types of evaluation measures. Moreover, we find that MLMs debiased using different methods still re-learn social biases during fine-tuning on downstream tasks. We identify the social biases in both training instances as well as their assigned labels as reasons for the discrepancy between intrinsic and extrinsic bias evaluation measurements. Overall, our findings highlight the limitations of existing MLM bias evaluation measures and raise concerns on the deployment of MLMs in downstream applications using those measures.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention
Authors:
Mengsay Loem,
Sho Takase,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$-gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks,…
▽ More
Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$-gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks, we show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention, and their combinations can further improve performance of vanilla Transformer.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Gender Bias in Meta-Embeddings
Authors:
Masahiro Kaneko,
Danushka Bollegala,
Naoaki Okazaki
Abstract:
Different methods have been proposed to develop meta-embeddings from a given set of source embeddings. However, the source embeddings can contain unfair gender-related biases, and how these influence the meta-embeddings has not been studied yet. We study the gender bias in meta-embeddings created under three different settings: (1) meta-embedding multiple sources without performing any debiasing (…
▽ More
Different methods have been proposed to develop meta-embeddings from a given set of source embeddings. However, the source embeddings can contain unfair gender-related biases, and how these influence the meta-embeddings has not been studied yet. We study the gender bias in meta-embeddings created under three different settings: (1) meta-embedding multiple sources without performing any debiasing (Multi-Source No-Debiasing), (2) meta-embedding multiple sources debiased by a single method (Multi-Source Single-Debiasing), and (3) meta-embedding a single source debiased by different methods (Single-Source Multi-Debiasing). Our experimental results show that meta-embedding amplifies the gender biases compared to input source embeddings. We find that debiasing not only the sources but also their meta-embedding is needed to mitigate those biases. Moreover, we propose a novel debiasing method based on meta-embedding learning where we use multiple debiasing methods on a single source embedding and then create a single unbiased meta-embedding.
△ Less
Submitted 6 October, 2022; v1 submitted 19 May, 2022;
originally announced May 2022.
-
Gender Bias in Masked Language Models for Multiple Languages
Authors:
Masahiro Kaneko,
Aizhan Imankulova,
Danushka Bollegala,
Naoaki Okazaki
Abstract:
Masked Language Models (MLMs) pre-trained by predicting masked tokens on large corpora have been used successfully in natural language processing tasks for a variety of languages. Unfortunately, it was reported that MLMs also learn discriminative biases regarding attributes such as gender and race. Because most studies have focused on MLMs in English, the bias of MLMs in other languages has rarely…
▽ More
Masked Language Models (MLMs) pre-trained by predicting masked tokens on large corpora have been used successfully in natural language processing tasks for a variety of languages. Unfortunately, it was reported that MLMs also learn discriminative biases regarding attributes such as gender and race. Because most studies have focused on MLMs in English, the bias of MLMs in other languages has rarely been investigated. Manual annotation of evaluation data for languages other than English has been challenging due to the cost and difficulty in recruiting annotators. Moreover, the existing bias evaluation methods require the stereotypical sentence pairs consisting of the same context with attribute words (e.g. He/She is a nurse). We propose Multilingual Bias Evaluation (MBE) score, to evaluate bias in various languages using only English attribute word lists and parallel corpora between the target language and English without requiring manually annotated data. We evaluated MLMs in eight languages using the MBE and confirmed that gender-related biases are encoded in MLMs for all those languages. We manually created datasets for gender bias in Japanese and Russian to evaluate the validity of the MBE. The results show that the bias scores reported by the MBE significantly correlates with that computed from the above manually created datasets and the existing English datasets for gender bias.
△ Less
Submitted 4 May, 2022; v1 submitted 1 May, 2022;
originally announced May 2022.
-
Sense Embeddings are also Biased--Evaluating Social Biases in Static and Contextualised Sense Embeddings
Authors:
Yi Zhou,
Masahiro Kaneko,
Danushka Bollegala
Abstract:
Sense embedding learning methods learn different embeddings for the different senses of an ambiguous word. One sense of an ambiguous word might be socially biased while its other senses remain unbiased. In comparison to the numerous prior work evaluating the social biases in pretrained word embeddings, the biases in sense embeddings have been relatively understudied. We create a benchmark dataset…
▽ More
Sense embedding learning methods learn different embeddings for the different senses of an ambiguous word. One sense of an ambiguous word might be socially biased while its other senses remain unbiased. In comparison to the numerous prior work evaluating the social biases in pretrained word embeddings, the biases in sense embeddings have been relatively understudied. We create a benchmark dataset for evaluating the social biases in sense embeddings and propose novel sense-specific bias evaluation measures. We conduct an extensive evaluation of multiple static and contextualised sense embeddings for various types of social biases using the proposed measures. Our experimental results show that even in cases where no biases are found at word-level, there still exist worrying levels of social biases at sense-level, which are often ignored by the word-level bias evaluation measures.
△ Less
Submitted 16 March, 2022; v1 submitted 14 March, 2022;
originally announced March 2022.
-
Interpretability for Language Learners Using Example-Based Grammatical Error Correction
Authors:
Masahiro Kaneko,
Sho Takase,
Ayana Niwa,
Naoaki Okazaki
Abstract:
Grammatical Error Correction (GEC) should not focus only on high accuracy of corrections but also on interpretability for language learning. However, existing neural-based GEC models mainly aim at improving accuracy, and their interpretability has not been explored. A promising approach for improving interpretability is an example-based method, which uses similar retrieved examples to generate cor…
▽ More
Grammatical Error Correction (GEC) should not focus only on high accuracy of corrections but also on interpretability for language learning. However, existing neural-based GEC models mainly aim at improving accuracy, and their interpretability has not been explored. A promising approach for improving interpretability is an example-based method, which uses similar retrieved examples to generate corrections. In addition, examples are beneficial in language learning, hel** learners understand the basis of grammatically incorrect/correct texts and improve their confidence in writing. Therefore, we hypothesize that incorporating an example-based method into GEC can improve interpretability as well as support language learners. In this study, we introduce an Example-Based GEC (EB-GEC) that presents examples to language learners as a basis for a correction result. The examples consist of pairs of correct and incorrect sentences similar to a given input and its predicted correction. Experiments demonstrate that the examples presented by EB-GEC help language learners decide to accept or refuse suggestions from the GEC output. Furthermore, the experiments also show that retrieved examples improve the accuracy of corrections.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Proficiency Matters Quality Estimation in Grammatical Error Correction
Authors:
Yu** Takahashi,
Masahiro Kaneko,
Masato Mita,
Mamoru Komachi
Abstract:
This study investigates how supervised quality estimation (QE) models of grammatical error correction (GEC) are affected by the learners' proficiency with the data. QE models for GEC evaluations in prior work have obtained a high correlation with manual evaluations. However, when functioning in a real-world context, the data used for the reported results have limitations because prior works were b…
▽ More
This study investigates how supervised quality estimation (QE) models of grammatical error correction (GEC) are affected by the learners' proficiency with the data. QE models for GEC evaluations in prior work have obtained a high correlation with manual evaluations. However, when functioning in a real-world context, the data used for the reported results have limitations because prior works were biased toward data by learners with relatively high proficiency levels. To address this issue, we created a QE dataset that includes multiple proficiency levels and explored the necessity of performing proficiency-wise evaluation for QE of GEC. Our experiments demonstrated that differences in evaluation dataset proficiency affect the performance of QE models, and proficiency-wise evaluation helps create more robust models.
△ Less
Submitted 16 January, 2022;
originally announced January 2022.
-
ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization
Authors:
Mengsay Loem,
Sho Takase,
Masahiro Kaneko,
Naoaki Okazaki
Abstract:
Neural models trained with large amount of parallel data have achieved impressive performance in abstractive summarization tasks. However, large-scale parallel corpora are expensive and challenging to construct. In this work, we introduce a low-cost and effective strategy, ExtraPhrase, to augment training data for abstractive summarization tasks. ExtraPhrase constructs pseudo training data in two…
▽ More
Neural models trained with large amount of parallel data have achieved impressive performance in abstractive summarization tasks. However, large-scale parallel corpora are expensive and challenging to construct. In this work, we introduce a low-cost and effective strategy, ExtraPhrase, to augment training data for abstractive summarization tasks. ExtraPhrase constructs pseudo training data in two steps: extractive summarization and paraphrasing. We extract major parts of an input text in the extractive summarization step, and obtain its diverse expressions with the paraphrasing step. Through experiments, we show that ExtraPhrase improves the performance of abstractive summarization tasks by more than 0.50 points in ROUGE scores compared to the setting without data augmentation. ExtraPhrase also outperforms existing methods such as back-translation and self-training. We also show that ExtraPhrase is significantly effective when the amount of genuine training data is remarkably small, i.e., a low-resource setting. Moreover, ExtraPhrase is more cost-efficient than the existing approaches.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation
Authors:
Seiichiro Kondo,
Kengo Hotate,
Masahiro Kaneko,
Mamoru Komachi
Abstract:
Neural machine translation (NMT) has recently gained widespread attention because of its high translation accuracy. However, it shows poor performance in the translation of long sentences, which is a major issue in low-resource languages. It is assumed that this issue is caused by insufficient number of long sentences in the training data. Therefore, this study proposes a simple data augmentation…
▽ More
Neural machine translation (NMT) has recently gained widespread attention because of its high translation accuracy. However, it shows poor performance in the translation of long sentences, which is a major issue in low-resource languages. It is assumed that this issue is caused by insufficient number of long sentences in the training data. Therefore, this study proposes a simple data augmentation method to handle long sentences. In this method, we use only the given parallel corpora as the training data and generate long sentences by concatenating two sentences. Based on the experimental results, we confirm improvements in long sentence translation by the proposed data augmentation method, despite its simplicity. Moreover, the translation quality is further improved by the proposed method, when combined with back-translation.
△ Less
Submitted 17 April, 2021;
originally announced April 2021.
-
Comparison of Grammatical Error Correction Using Back-Translation Models
Authors:
Aomi Koyama,
Kengo Hotate,
Masahiro Kaneko,
Mamoru Komachi
Abstract:
Grammatical error correction (GEC) suffers from a lack of sufficient parallel data. Therefore, GEC studies have developed various methods to generate pseudo data, which comprise pairs of grammatical and artificially produced ungrammatical sentences. Currently, a mainstream approach to generate pseudo data is back-translation (BT). Most previous GEC studies using BT have employed the same architect…
▽ More
Grammatical error correction (GEC) suffers from a lack of sufficient parallel data. Therefore, GEC studies have developed various methods to generate pseudo data, which comprise pairs of grammatical and artificially produced ungrammatical sentences. Currently, a mainstream approach to generate pseudo data is back-translation (BT). Most previous GEC studies using BT have employed the same architecture for both GEC and BT models. However, GEC models have different correction tendencies depending on their architectures. Thus, in this study, we compare the correction tendencies of the GEC models trained on pseudo data generated by different BT models, namely, Transformer, CNN, and LSTM. The results confirm that the correction tendencies for each error type are different for every BT model. Additionally, we examine the correction tendencies when using a combination of pseudo data generated by different BT models. As a result, we find that the combination of different BT models improves or interpolates the F_0.5 scores of each error type compared with that of single BT models with different seeds.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Unmasking the Mask -- Evaluating Social Biases in Masked Language Models
Authors:
Masahiro Kaneko,
Danushka Bollegala
Abstract:
Masked Language Models (MLMs) have shown superior performances in numerous downstream NLP tasks when used as text encoders. Unfortunately, MLMs also demonstrate significantly worrying levels of social biases. We show that the previously proposed evaluation metrics for quantifying the social biases in MLMs are problematic due to following reasons: (1) prediction accuracy of the masked tokens itself…
▽ More
Masked Language Models (MLMs) have shown superior performances in numerous downstream NLP tasks when used as text encoders. Unfortunately, MLMs also demonstrate significantly worrying levels of social biases. We show that the previously proposed evaluation metrics for quantifying the social biases in MLMs are problematic due to following reasons: (1) prediction accuracy of the masked tokens itself tend to be low in some MLMs, which raises questions regarding the reliability of the evaluation metrics that use the (pseudo) likelihood of the predicted tokens, and (2) the correlation between the prediction accuracy of the mask and the performance in downstream NLP tasks is not taken into consideration, and (3) high frequency words in the training data are masked more often, introducing noise due to this selection bias in the test cases. To overcome the above-mentioned disfluencies, we propose All Unmasked Likelihood (AUL), a bias evaluation measure that predicts all tokens in a test case given the MLM embedding of the unmasked input. We find that AUL accurately detects different types of biases in MLMs. We also propose AUL with attention weights (AULA) to evaluate tokens based on their importance in a sentence. However, unlike AUL and AULA, previously proposed bias evaluation measures for MLMs systematically overestimate the measured biases, and are heavily influenced by the unmasked tokens in the context.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Simultaneous Multi-Pivot Neural Machine Translation
Authors:
Raj Dabre,
Aizhan Imankulova,
Masahiro Kaneko,
Abhisek Chakrabarty
Abstract:
Parallel corpora are indispensable for training neural machine translation (NMT) models, and parallel corpora for most language pairs do not exist or are scarce. In such cases, pivot language NMT can be helpful where a pivot language is used such that there exist parallel corpora between the source and pivot and pivot and target languages. Naturally, the quality of pivot language translation is mo…
▽ More
Parallel corpora are indispensable for training neural machine translation (NMT) models, and parallel corpora for most language pairs do not exist or are scarce. In such cases, pivot language NMT can be helpful where a pivot language is used such that there exist parallel corpora between the source and pivot and pivot and target languages. Naturally, the quality of pivot language translation is more inferior to what could be achieved with a direct parallel corpus of a reasonable size for that pair. In a real-time simultaneous translation setting, the quality of pivot language translation deteriorates even further given that the model has to output translations the moment a few source words become available. To solve this issue, we propose multi-pivot translation and apply it to a simultaneous translation setting involving pivot languages. Our approach involves simultaneously translating a source language into multiple pivots, which are then simultaneously translated together into the target language by leveraging multi-source NMT. Our experiments in a low-resource setting using the N-way parallel UN corpus for Arabic to English NMT via French and Spanish as pivots reveals that in a simultaneous pivot NMT setting, using two pivot languages can lead to an improvement of up to 5.8 BLEU.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Dictionary-based Debiasing of Pre-trained Word Embeddings
Authors:
Masahiro Kaneko,
Danushka Bollegala
Abstract:
Word embeddings trained on large corpora have shown to encode high levels of unfair discriminatory gender, racial, religious and ethnic biases.
In contrast, human-written dictionaries describe the meanings of words in a concise, objective and an unbiased manner.
We propose a method for debiasing pre-trained word embeddings using dictionaries, without requiring access to the original training r…
▽ More
Word embeddings trained on large corpora have shown to encode high levels of unfair discriminatory gender, racial, religious and ethnic biases.
In contrast, human-written dictionaries describe the meanings of words in a concise, objective and an unbiased manner.
We propose a method for debiasing pre-trained word embeddings using dictionaries, without requiring access to the original training resources or any knowledge regarding the word embedding algorithms used.
Unlike prior work, our proposed method does not require the types of biases to be pre-defined in the form of word lists, and learns the constraints that must be satisfied by unbiased word embeddings automatically from dictionary definitions of the words.
Specifically, we learn an encoder to generate a debiased version of an input word embedding such that it
(a) retains the semantics of the pre-trained word embeddings,
(b) agrees with the unbiased definition of the word according to the dictionary, and
(c) remains orthogonal to the vector space spanned by any biased basis vectors in the pre-trained word embedding space.
Experimental results on standard benchmark datasets show that the proposed method can accurately remove unfair biases encoded in pre-trained word embeddings, while preserving useful semantics.
△ Less
Submitted 23 January, 2021;
originally announced January 2021.
-
Debiasing Pre-trained Contextualised Embeddings
Authors:
Masahiro Kaneko,
Danushka Bollegala
Abstract:
In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextual…
▽ More
In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextualised embedding model, without requiring to retrain those models. Using gender bias as an illustrative example, we then conduct a systematic study using several state-of-the-art (SoTA) contextualised representations on multiple benchmark datasets to evaluate the level of biases encoded in different contextualised embeddings before and after debiasing using the proposed method. We find that applying token-level debiasing for all tokens and across all layers of a contextualised embedding model produces the best performance. Interestingly, we observe that there is a trade-off between creating an accurate vs. unbiased contextualised embedding model, and different contextualised embedding models respond differently to this trade-off.
△ Less
Submitted 23 January, 2021;
originally announced January 2021.
-
Autoencoding Improves Pre-trained Word Embeddings
Authors:
Masahiro Kaneko,
Danushka Bollegala
Abstract:
Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically, this post-processing step is equivalent to applying a linear autoencoder to minimise the squared…
▽ More
Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically, this post-processing step is equivalent to applying a linear autoencoder to minimise the squared l2 reconstruction error. This result contradicts prior work (Mu and Viswanath, 2018) that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labelled data.
△ Less
Submitted 27 October, 2020; v1 submitted 25 October, 2020;
originally announced October 2020.
-
A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction
Authors:
Masato Mita,
Shun Kiyono,
Masahiro Kaneko,
Jun Suzuki,
Kentaro Inui
Abstract:
Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of "noise" where errors were inappropriately edited or left unc…
▽ More
Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
Energy Efficient Resource Allocation Optimization in Fog Radio Access Networks with Outdated Channel Knowledge
Authors:
Thi Ha Ly Dinh,
Megumi Kaneko,
Ellen Hidemi Fukuda,
Lila Boukhatem
Abstract:
Fog Radio Access Networks (F-RAN) are gaining worldwide interests for enabling mobile edge computing for Beyond 5G. However, to realize the future real-time and delay-sensitive applications, F-RAN tailored radio resource allocation and interference management become necessary. This work investigates user association and beamforming issues for providing energy efficient F-RANs. We formulate the ene…
▽ More
Fog Radio Access Networks (F-RAN) are gaining worldwide interests for enabling mobile edge computing for Beyond 5G. However, to realize the future real-time and delay-sensitive applications, F-RAN tailored radio resource allocation and interference management become necessary. This work investigates user association and beamforming issues for providing energy efficient F-RANs. We formulate the energy efficiency maximization problem, where the F-RAN specific constraint to guarantee local edge processing is explicitly considered. To solve this intricate problem, we design an algorithm based on the Augmented Lagrangian (AL) method. Then, to alleviate the computational complexity, a heuristic low-complexity strategy is developed, where the tasks are split in two parts: one solving for user association and Fog Access Points (F-AP) activation in a centralized manner at the cloud, based on global but outdated user Channel State Information (CSI) to account for fronthaul delays, and the second solving for beamforming in a distributed manner at each active F-AP based on perfect but local CSIs. Simulation results show that the proposed heuristic method achieves an appreciable performance level as compared to the AL-based method, while largely outperforming the energy efficiency of the baseline F-RAN scheme and limiting the sum-rate degradation compared to the optimized sum-rate maximization algorithm.
△ Less
Submitted 25 September, 2020;
originally announced September 2020.
-
Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction
Authors:
Masahiro Kaneko,
Masato Mita,
Shun Kiyono,
Jun Suzuki,
Kentaro Inui
Abstract:
This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For exam…
▽ More
This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the fine-tuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-of-the-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: https://github.com/kanekomasahiro/bert-gec.
△ Less
Submitted 31 May, 2020; v1 submitted 3 May, 2020;
originally announced May 2020.
-
Towards Multimodal Simultaneous Neural Machine Translation
Authors:
Aizhan Imankulova,
Masahiro Kaneko,
Tosho Hirasawa,
Mamoru Komachi
Abstract:
Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding in multiple languages. This task is significantly more challenging than the general full sentence translation because of the shortage of input information during decoding. To alleviate this shortage, we propose multimodal simultaneous neural machine trans…
▽ More
Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding in multiple languages. This task is significantly more challenging than the general full sentence translation because of the shortage of input information during decoding. To alleviate this shortage, we propose multimodal simultaneous neural machine translation (MSNMT), which leverages visual information as an additional modality. Our experiments with the Multi30k dataset showed that MSNMT significantly outperforms its text-only counterpart in more timely translation situations with low latency. Furthermore, we verified the importance of visual information during decoding by performing an adversarial evaluation of MSNMT, where we studied how models behaved with incongruent input modality and analyzed the effect of different word order between source and target languages.
△ Less
Submitted 23 October, 2020; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Power and Beam Optimization for Uplink Millimeter-Wave Hotspot Communication Systems
Authors:
Rafail Ismayilov,
Bernd Holfeld,
Renato L. G. Cavalcante,
Megumi Kaneko
Abstract:
We propose an effective interference management and beamforming mechanism for uplink communication systems that yields fair allocation of rates. In particular, we consider a hotspot area of a millimeter-wave (mmWave) access network consisting of multiple user equipment (UE) in the uplink and multiple access points (APs) with directional antennas and adjustable beam widths and directions (beam conf…
▽ More
We propose an effective interference management and beamforming mechanism for uplink communication systems that yields fair allocation of rates. In particular, we consider a hotspot area of a millimeter-wave (mmWave) access network consisting of multiple user equipment (UE) in the uplink and multiple access points (APs) with directional antennas and adjustable beam widths and directions (beam configurations). This network suffers tremendously from multi-beam multi-user interference, and, to improve the uplink transmission performance, we propose a centralized scheme that optimizes the power, the beam width, the beam direction of the APs, and the UE - AP assignments. This problem involves both continuous and discrete variables, and it has the following structure. If we fix all discrete variables, except for those related to the UE-AP assignment, the resulting optimization problem can be solved optimally. This property enables us to propose a heuristic based on simulated annealing (SA) to address the intractable joint optimization problem with all discrete variables. In more detail, for a fixed configuration of beams, we formulate a weighted rate allocation problem where each user gets the same portion of its maximum achievable rate that it would have under non-interfered conditions. We solve this problem with an iterative fixed point algorithm that optimizes the power of UEs and the UE - AP assignment in the uplink. This fixed point algorithm is combined with SA to improve the beam configurations. Theoretical and numerical results show that the proposed method improves both the UE rates in the lower percentiles and the overall fairness in the network.
△ Less
Submitted 9 August, 2021; v1 submitted 29 August, 2019;
originally announced August 2019.
-
Gender-preserving Debiasing for Pre-trained Word Embeddings
Authors:
Masahiro Kaneko,
Danushka Bollegala
Abstract:
Word embeddings learnt from massive text collections have demonstrated significant levels of discriminative biases such as gender, racial or ethnic biases, which in turn bias the down-stream NLP applications that use those word embeddings. Taking gender-bias as a working example, we propose a debiasing method that preserves non-discriminative gender-related information, while removing stereotypica…
▽ More
Word embeddings learnt from massive text collections have demonstrated significant levels of discriminative biases such as gender, racial or ethnic biases, which in turn bias the down-stream NLP applications that use those word embeddings. Taking gender-bias as a working example, we propose a debiasing method that preserves non-discriminative gender-related information, while removing stereotypical discriminative gender biases from pre-trained word embeddings. Specifically, we consider four types of information: \emph{feminine}, \emph{masculine}, \emph{gender-neutral} and \emph{stereotypical}, which represent the relationship between gender vs. bias, and propose a debiasing method that (a) preserves the gender-related information in feminine and masculine words, (b) preserves the neutrality in gender-neutral words, and (c) removes the biases from stereotypical words. Experimental results on several previously proposed benchmark datasets show that our proposed method can debias pre-trained word embeddings better than existing SoTA methods proposed for debiasing word embeddings while preserving gender-related but non-discriminative information.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
TriDepth: Triangular Patch-based Deep Depth Prediction
Authors:
Masaya Kaneko,
Ken Sakurada,
Kiyoharu Aizawa
Abstract:
We propose a novel and efficient representation for single-view depth estimation using Convolutional Neural Networks (CNNs). Point-cloud is generally used for CNN-based 3D scene reconstruction; however it has some drawbacks: (1) it is redundant as a representation for planar surfaces, and (2) no spatial relationships between points are available (e.g, texture and surface). As a more efficient repr…
▽ More
We propose a novel and efficient representation for single-view depth estimation using Convolutional Neural Networks (CNNs). Point-cloud is generally used for CNN-based 3D scene reconstruction; however it has some drawbacks: (1) it is redundant as a representation for planar surfaces, and (2) no spatial relationships between points are available (e.g, texture and surface). As a more efficient representation, we introduce a triangular-patch-cloud, which represents the surface of the 3D structure using a set of triangular patches, and propose a CNN framework for its 3D structure estimation. In our framework, we create it by separating all the faces in a 2D mesh, which are determined adaptively from the input image, and estimate depths and normals of all the faces. Using a common RGBD-dataset, we show that our representation has a better or comparable performance than the existing point-cloud-based methods, although it has much less parameters.
△ Less
Submitted 11 March, 2020; v1 submitted 3 May, 2019;
originally announced May 2019.
-
Joint Allocation Strategies of Power and Spreading Factors with Imperfect Orthogonality in LoRa Networks
Authors:
Licia Amichi,
Megumi Kaneko,
Ellen Hidemi Fukuda,
Nancy El Rachkidy,
Alexandre Guitton
Abstract:
The LoRa physical layer is one of the most promising Low Power Wide-Area Network (LPWAN) technologies for future Internet of Things (IoT) applications. It provides a flexible adaptation of coverage and data rate by allocating different Spreading Factors (SFs) and transmit powers to end-devices. We focus on improving throughput fairness while reducing energy consumption. Whereas most existing metho…
▽ More
The LoRa physical layer is one of the most promising Low Power Wide-Area Network (LPWAN) technologies for future Internet of Things (IoT) applications. It provides a flexible adaptation of coverage and data rate by allocating different Spreading Factors (SFs) and transmit powers to end-devices. We focus on improving throughput fairness while reducing energy consumption. Whereas most existing methods assume perfect SF orthogonality and ignore the harmful effects of inter-SF interferences, we formulate a joint SF and power allocation problem to maximize the minimum uplink throughput of end-devices, subject to co-SF and inter-SF interferences, and power constraints. This results into a mixed-integer non-linear optimization, which, for tractability, is split into two sub-problems: firstly, the SF assignment for fixed transmit powers, and secondly, the power allocation given the previously obtained assignment solution. For the first sub-problem, we propose a low-complexity many-to-one matching algorithm between SFs and end-devices. For the second one, given its intractability, we transform it using two types of constraints approximation: a linearized and a quadratic version. Our performance evaluation demonstrates that the proposed joint SF allocation and power optimization enables to drastically enhance various performance objectives such as throughput, fairness and power consumption, and that it outperforms baseline schemes.
△ Less
Submitted 25 April, 2019;
originally announced April 2019.
-
Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection
Authors:
Masahiro Kaneko,
Mamoru Komachi
Abstract:
It is known that a deep neural network model pre-trained with large-scale data greatly improves the accuracy of various tasks, especially when there are resource constraints. However, the information needed to solve a given task can vary, and simply using the output of the final layer is not necessarily sufficient. Moreover, to our knowledge, exploiting large language representation models to dete…
▽ More
It is known that a deep neural network model pre-trained with large-scale data greatly improves the accuracy of various tasks, especially when there are resource constraints. However, the information needed to solve a given task can vary, and simply using the output of the final layer is not necessarily sufficient. Moreover, to our knowledge, exploiting large language representation models to detect grammatical errors has not yet been studied. In this work, we investigate the effect of utilizing information not only from the final layer but also from intermediate layers of a pre-trained language representation model to detect grammatical errors. We propose a multi-head multi-layer attention model that determines the appropriate layers in Bidirectional Encoder Representation from Transformers (BERT). The proposed method achieved the best scores on three datasets for grammatical error detection tasks, outperforming the current state-of-the-art method by 6.0 points on FCE, 8.2 points on CoNLL14, and 12.2 points on JFLEG in terms of F_0.5. We also demonstrate that by using multi-head multi-layer attention, our model can exploit a broader range of information for each token in a sentence than a model that uses only the final layer's information.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Cross-Corpora Evaluation and Analysis of Grammatical Error Correction Models --- Is Single-Corpus Evaluation Enough?
Authors:
Masato Mita,
Tomoya Mizumoto,
Masahiro Kaneko,
Ryo Nagata,
Kentaro Inui
Abstract:
This study explores the necessity of performing cross-corpora evaluation for grammatical error correction (GEC) models. GEC models have been previously evaluated based on a single commonly applied corpus: the CoNLL-2014 benchmark. However, the evaluation remains incomplete because the task difficulty varies depending on the test corpus and conditions such as the proficiency levels of the writers a…
▽ More
This study explores the necessity of performing cross-corpora evaluation for grammatical error correction (GEC) models. GEC models have been previously evaluated based on a single commonly applied corpus: the CoNLL-2014 benchmark. However, the evaluation remains incomplete because the task difficulty varies depending on the test corpus and conditions such as the proficiency levels of the writers and essay topics. To overcome this limitation, we evaluate the performance of several GEC models, including NMT-based (LSTM, CNN, and transformer) and an SMT-based model, against various learner corpora (CoNLL-2013, CoNLL-2014, FCE, JFLEG, ICNALE, and KJ). Evaluation results reveal that the models' rankings considerably vary depending on the corpus, indicating that single-corpus evaluation is insufficient for GEC models.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.
-
Interference Management in NOMA-based Fog-Radio Access Networks via Joint Scheduling and Power Adaptation
Authors:
Itsikiantsoa Randrianantenaina,
Megumi Kaneko,
Hayssam Dahrouj,
Hesham ElSawy,
Mohamed-Slim Alouini
Abstract:
Non-Orthogonal Multiple Access (NOMA) and Fog Radio Access Networks (FRAN) are promising candidates within the 5G and beyond systems. This work examines the benefit of adopting NOMA in an FRAN architecture with constrained capacity fronthaul. The paper proposes methods for optimizing joint scheduling and power adaptation in the downlink of a NOMA-based FRAN with multiple resource blocks (RB). We c…
▽ More
Non-Orthogonal Multiple Access (NOMA) and Fog Radio Access Networks (FRAN) are promising candidates within the 5G and beyond systems. This work examines the benefit of adopting NOMA in an FRAN architecture with constrained capacity fronthaul. The paper proposes methods for optimizing joint scheduling and power adaptation in the downlink of a NOMA-based FRAN with multiple resource blocks (RB). We consider a mixed-integer optimization problem which maximizes a network-wide rate-based utility function subject to fronthaul-capacity constraints, so as to determine i) the user-to-RB assignment, ii) the allocated power to each RB, and iii) the power split levels of the NOMA users in each RB. The paper proposes a feasible decoupled solution for such non-convex optimization problem using a three-step hybrid centralized/distributed approach. The proposed solution complies with FRAN operation that aims to partially shift the network control to the FAPs, so as to overcome delays due to fronthaul rate constraints. The paper proposes and compares two distinct methods for solving the assignment problem, namely the Hungarian method, and the Multiple Choice Knapsack method. The power allocation and the NOMA power split optimization, on the other hand, are solved using the alternating direction method of multipliers (ADMM). Simulations results illustrate the advantages of the proposed methods compared to different baseline schemes including the conventional Orthogonal Multiple Access (OMA), for different utility functions and different network environments.
△ Less
Submitted 27 February, 2019;
originally announced February 2019.
-
DeepSaucer: Unified Environment for Verifying Deep Neural Networks
Authors:
Naoto Sato,
Hironobu Kuruma,
Masanori Kaneko,
Yuichiroh Nakagawa,
Hideto Ogawa,
Thai Son Hoang,
Michael Butler
Abstract:
In recent years, a number of methods for verifying DNNs have been developed. Because the approaches of the methods differ and have their own limitations, we think that a number of verification methods should be applied to a developed DNN. To apply a number of methods to the DNN, it is necessary to translate either the implementation of the DNN or the verification method so that one runs in the sam…
▽ More
In recent years, a number of methods for verifying DNNs have been developed. Because the approaches of the methods differ and have their own limitations, we think that a number of verification methods should be applied to a developed DNN. To apply a number of methods to the DNN, it is necessary to translate either the implementation of the DNN or the verification method so that one runs in the same environment as the other. Since those translations are time-consuming, a utility tool, named DeepSaucer, which helps to retain and reuse implementations of DNNs, verification methods, and their environments, is proposed. In DeepSaucer, code snippets of loading DNNs, running verification methods, and creating their environments are retained and reused as software assets in order to reduce cost of verifying DNNs. The feasibility of DeepSaucer is confirmed by implementing it on the basis of Anaconda, which provides virtual environment for loading a DNN and running a verification method. In addition, the effectiveness of DeepSaucer is demonstrated by usecase examples.
△ Less
Submitted 8 November, 2018;
originally announced November 2018.
-
SS5G: Collision Resolution Protocol for Delay and Energy Efficient LoRa Networks
Authors:
Nancy El Rachkidy,
Alexandre Guitton,
Megumi Kaneko
Abstract:
Future 5G and Internet of Things (IoT) applications will heavily rely on long-range communication technologies such as low-power wireless area networks (LPWANs). In particular, LoRaWAN built on LoRa physical layer is gathering increasing interests, both from academia and industries, for enabling low-cost energy efficient IoT wireless sensor networks for, e.g., environmental monitoring over wide ar…
▽ More
Future 5G and Internet of Things (IoT) applications will heavily rely on long-range communication technologies such as low-power wireless area networks (LPWANs). In particular, LoRaWAN built on LoRa physical layer is gathering increasing interests, both from academia and industries, for enabling low-cost energy efficient IoT wireless sensor networks for, e.g., environmental monitoring over wide areas. While its communication range may go up to 20 kilometers, the achievable bit rates in LoRaWAN are limited to a few kilobits per second. In the event of collisions, the perceived rate is further reduced due to packet loss and retransmissions. Firstly, to alleviate the harmful impacts of collisions, we propose a decoding algorithm that enables to resolve several superposed LoRa signals. Our proposed method exploits the slight desynchronization of superposed signals and specific features of LoRa physical layer. Secondly, we design a full MAC protocol enabling collision resolution. The simulation results demonstrate that the proposed method outperforms conventional LoRaWAN jointly in terms of system throughput, energy efficiency as well as delay. These results show that our scheme is well suited for 5G and IoT systems, as one of their major goals is to provide the best trade-off among these performance objectives.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.
-
Topology Control for Energy-Efficient Localization in Mobile Underwater Sensor Networks using Stackelberg Game
Authors:
Yali Yuan,
Chencheng Liang,
Megumi Kaneko,
Xu Chen,
Dieter Hogrefe
Abstract:
The characteristics of mobile Underwater Sensor Networks (UWSNs), such as low communication bandwidth, large propagation delay, and sparse deployment, pose challenging issues for successful localization of sensor nodes. In addition, sensor nodes in UWSNs are usually powered by batteries whose replacements introduce high cost and complexity. Thus, the critical problem in UWSNs is to enable each sen…
▽ More
The characteristics of mobile Underwater Sensor Networks (UWSNs), such as low communication bandwidth, large propagation delay, and sparse deployment, pose challenging issues for successful localization of sensor nodes. In addition, sensor nodes in UWSNs are usually powered by batteries whose replacements introduce high cost and complexity. Thus, the critical problem in UWSNs is to enable each sensor node to find enough anchor nodes in order to localize itself, with minimum energy costs. In this paper, an Energy-Efficient Localization Algorithm (EELA) is proposed to analyze the decentralized interactions among sensor nodes and anchor nodes. A Single-Leader-Multi-Follower Stackelberg game is utilized to formulate the topology control problem of sensor nodes and anchor nodes by exploiting their available communication opportunities. In this game, the sensor node acts as a leader taking into account factors such as 'two-hop' anchor nodes and energy consumption, while anchor nodes act as multiple followers, considering their ability to localize sensor nodes and their energy consumption. We prove that both players select best responses and reach a socially optimal Stackelberg Nash Equilibrium. Simulation results demonstrate that the proposed EELA improves the perfor- mance of localization in UWSNs significantly, and in particular the energy cost of sensor nodes. Compared to the baseline schemes, the energy consumption per node is about 48% lower in EELA, while providing a desirable localization coverage, under reasonable error and delay.
△ Less
Submitted 31 May, 2018;
originally announced May 2018.
-
Decoding Superposed LoRa Signals
Authors:
Nancy El Rachkidy,
Alexandre Guitton,
Megumi Kaneko
Abstract:
Long-range low-power wireless communications, such as LoRa, are used in many IoT and environmental monitoring applications. They typically increase the communication range to several kilometers, at the cost of reducing the bitrate to a few bits per seconds. Collisions further reduce the performance of these communications. In this paper, we propose two algorithms to decode colliding signals: one a…
▽ More
Long-range low-power wireless communications, such as LoRa, are used in many IoT and environmental monitoring applications. They typically increase the communication range to several kilometers, at the cost of reducing the bitrate to a few bits per seconds. Collisions further reduce the performance of these communications. In this paper, we propose two algorithms to decode colliding signals: one algorithm requires the transmitters to be slightly desynchronized, and the other requires the transmitters to be synchronized. To do so, we use the timing information to match the correct symbols to the correct transmitters. We show that our algorithms are able to significantly improve the overall throughput of LoRa.
△ Less
Submitted 21 March, 2018;
originally announced April 2018.