-
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity
Authors:
Sho Hoshino,
Akihiko Kato,
Soichiro Murakami,
Peinan Zhang
Abstract:
Learning better sentence embeddings leads to improved performance for natural language understanding tasks including semantic textual similarity (STS) and natural language inference (NLI). As prior studies leverage large-scale labeled NLI datasets for fine-tuning masked language models to yield sentence embeddings, task performance for languages other than English is often left behind. In this stu…
▽ More
Learning better sentence embeddings leads to improved performance for natural language understanding tasks including semantic textual similarity (STS) and natural language inference (NLI). As prior studies leverage large-scale labeled NLI datasets for fine-tuning masked language models to yield sentence embeddings, task performance for languages other than English is often left behind. In this study, we directly compared two data augmentation techniques as potential solutions for monolingual STS: (a) cross-lingual transfer that exploits English resources alone as training data to yield non-English sentence embeddings as zero-shot inference, and (b) machine translation that coverts English data into pseudo non-English training data in advance. In our experiments on monolingual STS in Japanese and Korean, we find that the two data techniques yield performance on par. Rather, we find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data. Combining our findings, we demonstrate that the cross-lingual transfer of Wikipedia data exhibits improved performance, and that native Wikipedia data can further improve performance for monolingual STS.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation
Authors:
Masato Mita,
Soichiro Murakami,
Akihiko Kato,
Peinan Zhang
Abstract:
In response to the limitations of manual ad creation, significant research has been conducted in the field of automatic ad text generation (ATG). However, the lack of comprehensive benchmarks and well-defined problem sets has made comparing different methods challenging. To tackle these challenges, we standardize the task of ATG and propose a first benchmark dataset, CAMERA, carefully designed and…
▽ More
In response to the limitations of manual ad creation, significant research has been conducted in the field of automatic ad text generation (ATG). However, the lack of comprehensive benchmarks and well-defined problem sets has made comparing different methods challenging. To tackle these challenges, we standardize the task of ATG and propose a first benchmark dataset, CAMERA, carefully designed and enabling the utilization of multi-modal information and facilitating industry-wise evaluations. Our extensive experiments with a variety of nine baselines, from classical methods to state-of-the-art models including large language models (LLMs), show the current state and the remaining challenges. We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.
△ Less
Submitted 17 June, 2024; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Natural Language Generation for Advertising: A Survey
Authors:
Soichiro Murakami,
Sho Hoshino,
Peinan Zhang
Abstract:
Natural language generation methods have emerged as effective tools to help advertisers increase the number of online advertisements they produce. This survey entails a review of the research trends on this topic over the past decade, from template-based to extractive and abstractive approaches using neural networks. Additionally, key challenges and directions revealed through the survey, includin…
▽ More
Natural language generation methods have emerged as effective tools to help advertisers increase the number of online advertisements they produce. This survey entails a review of the research trends on this topic over the past decade, from template-based to extractive and abstractive approaches using neural networks. Additionally, key challenges and directions revealed through the survey, including metric optimization, faithfulness, diversity, multimodality, and the development of benchmark datasets, are discussed.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Aspect-based Analysis of Advertising Appeals for Search Engine Advertising
Authors:
Soichiro Murakami,
Peinan Zhang,
Sho Hoshino,
Hidetaka Kamigaito,
Hiroya Takamura,
Manabu Okumura
Abstract:
Writing an ad text that attracts people and persuades them to click or act is essential for the success of search engine advertising. Therefore, ad creators must consider various aspects of advertising appeals (A$^3$) such as the price, product features, and quality. However, products and services exhibit unique effective A$^3$ for different industries. In this work, we focus on exploring the effe…
▽ More
Writing an ad text that attracts people and persuades them to click or act is essential for the success of search engine advertising. Therefore, ad creators must consider various aspects of advertising appeals (A$^3$) such as the price, product features, and quality. However, products and services exhibit unique effective A$^3$ for different industries. In this work, we focus on exploring the effective A$^3$ for different industries with the aim of assisting the ad creation process. To this end, we created a dataset of advertising appeals and used an existing model that detects various aspects for ad texts. Our experiments demonstrated that different industries have their own effective A$^3$ and that the identification of the A$^3$ contributes to the estimation of advertising performance.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
Coordinate descent heuristics for the irregular strip packing problem of rasterized shapes
Authors:
Shunji Umetani,
Shohei Murakami
Abstract:
We consider the irregular strip packing problem of rasterized shapes, where a given set of pieces of irregular shapes represented in pixels should be placed into a rectangular container without overlap. The rasterized shapes provide simple procedures of the intersection test without any exceptional handling due to geometric issues, while they often require much memory and computational effort in h…
▽ More
We consider the irregular strip packing problem of rasterized shapes, where a given set of pieces of irregular shapes represented in pixels should be placed into a rectangular container without overlap. The rasterized shapes provide simple procedures of the intersection test without any exceptional handling due to geometric issues, while they often require much memory and computational effort in high-resolution. To reduce the complexity of rasterized shapes, we propose a pair of scanlines representation called the double scanline representation that merges consecutive pixels in each row and column into strips with unit width, respectively. Based on this, we develop coordinate descent heuristics for the raster model that repeat a line search in the horizontal and vertical directions alternately, where we also introduce a corner detection technique used in computer vision to reduce the search space. Computational results for test instances show that the proposed algorithm obtains sufficiently dense layouts of rasterized shapes in high-resolution within a reasonable computation time.
△ Less
Submitted 22 March, 2022; v1 submitted 9 April, 2021;
originally announced April 2021.
-
NTT's Machine Translation Systems for WMT19 Robustness Task
Authors:
Soichiro Murakami,
Makoto Morishita,
Tsutomu Hirao,
Masaaki Nagata
Abstract:
This paper describes NTT's submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previo…
▽ More
This paper describes NTT's submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previous baseline. Experimental results revealed the placeholder mechanism, which temporarily replaces the non-standard tokens including emojis and emoticons with special placeholder tokens during translation, improves translation accuracy even with noisy texts.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.