-
WikiSplit++: Easy Data Refinement for Split and Rephrase
Authors:
Hayato Tsukagoshi,
Tsutomu Hirao,
Makoto Morishita,
Katsuki Chousa,
Ryohei Sasano,
Koichi Takeda
Abstract:
The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still…
▽ More
The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Improving Sentence Embeddings with an Automatically Generated NLI Dataset
Authors:
Soma Sato,
Hayato Tsukagoshi,
Ryohei Sasano,
Koichi Takeda
Abstract:
Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL makes great use of fine-tuning with a manually annotated natural language inference (NLI) datas…
▽ More
Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL makes great use of fine-tuning with a manually annotated natural language inference (NLI) dataset. We aim to improve sentence embeddings learned in an unsupervised setting by automatically generating an NLI dataset with an LLM and using it to fine-tune PromptEOL. In experiments on STS tasks, the proposed method achieved an average Spearman's rank correlation coefficient of 82.21 with respect to human evaluation, thus outperforming existing methods without using large, manually annotated datasets.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Japanese SimCSE Technical Report
Authors:
Hayato Tsukagoshi,
Ryohei Sasano,
Koichi Takeda
Abstract:
We report the development of Japanese SimCSE, Japanese sentence embedding models fine-tuned with SimCSE. Since there is a lack of sentence embedding models for Japanese that can be used as a baseline in sentence embedding research, we conducted extensive experiments on Japanese sentence embeddings involving 24 pre-trained Japanese or multilingual language models, five supervised datasets, and four…
▽ More
We report the development of Japanese SimCSE, Japanese sentence embedding models fine-tuned with SimCSE. Since there is a lack of sentence embedding models for Japanese that can be used as a baseline in sentence embedding research, we conducted extensive experiments on Japanese sentence embeddings involving 24 pre-trained Japanese or multilingual language models, five supervised datasets, and four unsupervised datasets. In this report, we provide the detailed training setup for Japanese SimCSE and their evaluation results.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Sentence Representations via Gaussian Embedding
Authors:
Shohei Yoda,
Hayato Tsukagoshi,
Ryohei Sasano,
Koichi Takeda
Abstract:
Recent progress in sentence embedding, which represents the meaning of a sentence as a point in a vector space, has achieved high performance on tasks such as a semantic textual similarity (STS) task. However, sentence representations as a point in a vector space can express only a part of the diverse information that sentences have, such as asymmetrical relationships between sentences. This paper…
▽ More
Recent progress in sentence embedding, which represents the meaning of a sentence as a point in a vector space, has achieved high performance on tasks such as a semantic textual similarity (STS) task. However, sentence representations as a point in a vector space can express only a part of the diverse information that sentences have, such as asymmetrical relationships between sentences. This paper proposes GaussCSE, a Gaussian distribution-based contrastive learning framework for sentence embedding that can handle asymmetric relationships between sentences, along with a similarity measure for identifying inclusion relations. Our experiments show that GaussCSE achieves the same performance as previous methods in natural language inference tasks, and is able to estimate the direction of entailment relations, which is difficult with point representations.
△ Less
Submitted 20 February, 2024; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Soft Pneumatic Actuator Capable of Generating Various Bending and Extension Motions Inspired by an Elephant Trunk
Authors:
Peizheng Yuan,
Hideyuki Tsukagoshi
Abstract:
Inspired by the dexterous handling ability of an elephant's trunk, we propose a pneumatic actuator that generates diverse bending and extension motions in a flexible arm. The actuator consists of two flexible tubes. Each flexible tube is restrained by a single string with variable length and tilt angle. Even if a single tube can perform only three simple types of motions (bending, extension, and h…
▽ More
Inspired by the dexterous handling ability of an elephant's trunk, we propose a pneumatic actuator that generates diverse bending and extension motions in a flexible arm. The actuator consists of two flexible tubes. Each flexible tube is restrained by a single string with variable length and tilt angle. Even if a single tube can perform only three simple types of motions (bending, extension, and helical), a variety of complex bending patterns can be created by arranging a pair of tubes in parallel and making the restraint variable. This performance takes advantage of the effect of the superposition of forces by arranging two tubes to constructively interfere with each other. This paper described six resulting pose patterns. First, the configuration and operating principle are described, and the fabrication method is explained. Next, two mathematical models and four finite element method-based analyses are introduced to predict the tip position changes in five motion patterns. All the models were validated through experiments. Finally, we experimentally demonstrated that the prototype SEMI-TRUNK can realize the action of grabbing a bottle and pouring water, verifying the effectiveness of the proposed method.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals
Authors:
Hayato Tsukagoshi,
Ryohei Sasano,
Koichi Takeda
Abstract:
There have been many successful applications of sentence embedding methods. However, it has not been well understood what properties are captured in the resulting sentence embeddings depending on the supervision signals. In this paper, we focus on two types of sentence embedding methods with similar architectures and tasks: one fine-tunes pre-trained language models on the natural language inferen…
▽ More
There have been many successful applications of sentence embedding methods. However, it has not been well understood what properties are captured in the resulting sentence embeddings depending on the supervision signals. In this paper, we focus on two types of sentence embedding methods with similar architectures and tasks: one fine-tunes pre-trained language models on the natural language inference task, and the other fine-tunes pre-trained language models on word prediction task from its definition sentence, and investigate their properties. Specifically, we compare their performances on semantic textual similarity (STS) tasks using STS datasets partitioned from two perspectives: 1) sentence source and 2) superficial similarity of the sentence pairs, and compare their performances on the downstream and probing tasks. Furthermore, we attempt to combine the two methods and demonstrate that combining the two methods yields substantially better performance than the respective methods on unsupervised STS tasks and downstream tasks.
△ Less
Submitted 10 June, 2022; v1 submitted 7 February, 2022;
originally announced February 2022.
-
DefSent: Sentence Embeddings using Definition Sentences
Authors:
Hayato Tsukagoshi,
Ryohei Sasano,
Koichi Takeda
Abstract:
Sentence embedding methods using natural language inference (NLI) datasets have been successfully applied to various tasks. However, these methods are only available for limited languages due to relying heavily on the large NLI datasets. In this paper, we propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary, which performs comparably on unsupervised se…
▽ More
Sentence embedding methods using natural language inference (NLI) datasets have been successfully applied to various tasks. However, these methods are only available for limited languages due to relying heavily on the large NLI datasets. In this paper, we propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary, which performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks than conventional methods. Since dictionaries are available for many languages, DefSent is more broadly applicable than methods using NLI datasets without constructing additional datasets. We demonstrate that DefSent performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks to the methods using large NLI datasets. Our code is publicly available at https://github.com/hpprc/defsent .
△ Less
Submitted 9 June, 2021; v1 submitted 10 May, 2021;
originally announced May 2021.