Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization

Yanyue Zhang

{}^{\spadesuit}

, Pengfei Li

{}^{\spadesuit}

, Yilong Lai

{}^{\spadesuit}

, Deyu Zhou

{}^{\spadesuit}

and Yulan He

{}^{\heartsuit}

{}^{\spadesuit}

School of Computer Science and Engineering, Key Laboratory of Computer Network
and Information Integration, Ministry of Education, Southeast University, China

{}^{\heartsuit}

Department of Informatics, King’s College London

{}^{\heartsuit}

The Alan Turing Institute
{yanyuez98,lip.f,yilong.lai,d.zhou}@seu.edu.cn,
[email protected] Corresponding author.

Abstract

As more than 70 $\%$ of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are reluctant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the over-reliance on a specific framework is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, data augmentation based on large language models faces two disadvantages: 1) the potential issues or toxicity in the augmented data; 2) the expensive costs. Therefore, in this paper, we propose a novel data augmentation framework based on both large and small language models for debiasing opinion summarization. In specific, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification. Experiments have proved that our framework can effectively alleviate emotional bias same as using only large models, but more economically.

Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization

Yanyue Zhang ${}^{\spadesuit}$ , Pengfei Li ${}^{\spadesuit}$ , Yilong Lai ${}^{\spadesuit}$ , Deyu Zhou^†^†thanks: Corresponding author. ${}^{\spadesuit}$ and Yulan He ${}^{\heartsuit}$ ${}^{\spadesuit}$ School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China ${}^{\heartsuit}$ Department of Informatics, King’s College London ${}^{\heartsuit}$ The Alan Turing Institute {yanyuez98,lip.f,yilong.lai,d.zhou}@seu.edu.cn, [email protected]

1 Introduction

With the unprecedented development of online interactive platforms, reviews on shop** platforms or social media become an important information source for manufacturers to make decisions. To cope with the flood of reviews, opinion summarization has received significant interest in natural language processing communities. Unlike other summarization tasks for news, Wikipedia, and medical treatment records, opinion summarization focuses on texts with user opinions and subjective emotions about an entity (e.g., a product, hotel, or restaurant). Accurately summarizing user perceptions and attitudes towards entities is a core requirement of opinion summarization.

Reviews:

① The tights are badly made and can’t last several washings (hang dry). The color is ugly, and my daughter hates $\dots$ ② $\dots$ common ballet tights. They can’t fit well and squish her toes as much as some others. $\dots$ ③ my 3 year old can’t fit into these perfectly. $\dots$ ④ Stiff fabric, runs small a though. $\dots$ ⑤ This is not my go to tight when my daughter needs new ones.

Summary by Coop:

These are great for the price. The tights are comfortable and don’t take up much space. The only thing is that they can be worn to wear with the flip flops $\dots$ (I’m not sure if you have to wear them).

Summary by Trace:

These are great for those who want to wear a small. They are very comfortable and fit well. The only problem is that they don’t last as long as some of the more expensive ones in the past. I would recommend these to anyone.

Table 1: Example summaries generated by Coop Iso et al. (2021), TRACE Zhang and Zhou (2023) for negative reviews. The red part represents negative, and the blue is positive.

However, as shown in Table 1, the current opinion summarization approaches such as Coop and TRACE, are reluctant to generate a negative opinion summary given the input of negative opinions. We further conducted quantitative analysis and found that the emotional precision of the negative summaries generated by the current approaches is very limited, ranging from 10% to 55%. Such significant sentiment bias might be attributed to the extremely unbalanced sentiment distribution in the dataset. Specifically, the proportion of reviews with a rating of more than 3 (positive) is 72.26 $\%$ in the Yelp dataset, while 83.5 $\%$ in the Amazon dataset.

In the existing bias mitigation methods, modifying the data distribution can fundamentally eliminate bias and is not limited to specific model frameworks, exhibiting strong generalization capabilities (Dixon et al., 2018; Pruksachatkun et al., 2021; Qian et al., 2022). Due to the powerful generation capabilities of large language models(LLMs), many works utilize them either as labelers to annotate unlabeled data (Yoo et al., 2021; Wang et al., 2021), or as generators to produce new data samples (Ye et al., 2022; Meng et al., 2022; Abaskohi et al., 2023; Gao et al., 2023). However, data augmentation based on LLMs has some drawbacks. 1) potentially risky. some studies have raised concerns about potential issues or toxicity in synthetic data from LLMsLi et al. (2023b, c); Pan et al. (2023). 2) expensive cost. Balancing the emotional distribution of a million-scale dataset requires generating a large amount of synthetic data, making direct data generation using large models potentially expensive.

Therefore, in the paper, we proposed LASS, a novel framework based on both LArge and Small language models for debiaSing opinion summarization. Firstly, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. We design prompts to ensure that the large pre-trained language model follows the minimal-edit principle when generating the counterfactual samples with opposite sentiments. Moreover, some counterfactual sample pairs of specific data sets are manually rewritten and used as samples inside the prompts for input into the generator to ensure that the modification of aspects and emotions is synchronized and reasonable.

Secondly, a disentangle reconstruction model is trained based on the generated data. Specifically, a disentangled autoencoder is proposed to obtain the sentiment and content representation through reconstruction, emotion, and distance constraints. Further, the new representations are obtained by exchanging the sentiment representation of the pair of counterfactual data, which are used to generate each other as the counterfactual reconstruction loss. To further constrain the emotion information, the original emotion representation is replaced with a learnable emotion label representation, where the weight depends on the outcome of emotion classification. Finally, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification.

The experimental results demonstrate that LASS achieved results comparable to LLMs only, with an average reduction of 265,000 synthetic data points. Employing LASS for data augmentation across the three models resulted in an average increase of 36% in negative sentiment accuracy without affecting the Rouge scores of the summaries, compared to 37% with LLMs only.

The main contributions of this paper are as follows:

•

We propose LASS, a data augmentation framework combining large and small language models to alleviate emotional bias by optimizing the emotional distribution of datasets.
•

We design a data reproduction method based on a disentangle reconstruction model, which generates additional data via decoding the combined new representations and filtering based on confusion degree and sentiment classification.
•

The experimental results demonstrate that LASS which combines large and small models can alleviate sentiment bias as effectively as the approach solely based on LLMs, but more economically.

2 Related Work

Refer to caption — Figure 1: The architecture of LASS.

2.1 Opinion Summarization

Opinion summarization generally focuses on user reviews about products, hotels, restaurants, and so on. The abstractive approaches mainly utilize an encoder-decoder architecture, exploring various structures such as AE, VAE, or denoising autoencoder(DAE)(Chu and Liu, 2019; Bražinskas et al., 2020; Amplayo and Lapata, 2020; Iso et al., 2021; Zhang and Zhou, 2023). During training, these models are constrained by the objective of reconstructing the input text, and during generation, they use the average of text representations as the summary representation for decoding. Subsequent approaches aimed to enhance the controllability of generating summaries by explicitly (Suhara et al., 2020; Elsahar et al., 2021; Amplayo et al., 2021a; Ke et al., 2022) or implicitly (Amplayo et al., 2021b) modeling aspect information. Some methods also explore ways to fuse input information for summarization beyond simple averaging, utilizing techniques like composite optimization (Iso et al., 2021), Wasserstein barycenter (Song et al., 2022), or hierarchical discrete latent space (Hosking et al., 2023).

2.2 Debiasing Strategies in NLP

Bias in NLP systems can typically be categorized as internal bias and external bias(Elsafoury et al., 2023; Li et al., 2023a), depending on whether the bias is related to the training data of downstream tasks. Internal bias often pertains to issues of social fairness(Parraga et al., 2022), such as gender and racial bias, which have been identified in the embeddings of pre-trained language models (Guo et al., 2022). Existing work has attempted to address these issues through methods like adjusting pre-training data, introducing additional objectives, or post-processing.

On the other hand, external bias related to downstream tasks is often associated with task-specific features, such as entity bias in fake news detection (Zhu et al., 2022), position bias in emotion cause extraction (Yan et al., 2021), and language bias in Visual Question Answering (VQA) (Cadene et al., 2019), and so on. To mitigate these specific biases, two distinct approaches have been developed: data distribution-related and model training-related (Shah et al., 2020; Parraga et al., 2022; Li et al., 2023a). In the data distribution-related approach, efforts are made to re-sample, weight, or generate data to counteract bias (Dixon et al., 2018; Pruksachatkun et al., 2021; Qian et al., 2022). In contrast, model training-related methods explore adversarial techniques, causality (Cadene et al., 2019; Zhu et al., 2022), disentanglement, and additional auxiliary modules to mitigate bias.

3 Methodology

In this section, we describe LASS, the data augmentation debias method via both LLMs and a small generator, a disentangle autoencoder. As Figure 1 shows, the overall architecture of LASS contains three processes, pair data creation via LLMs, Dis-AE model training, and data reproduction via Dis-AE. We first employ the LLMs with manual demonstrations to obtain pairs of counterfactual data. Then, based on pair samples, the disentanglement reconstruction model, Dis-AE, is elaborated in Section 3.2 with the training. Finally, large-scale negative reviews are generated by data reproduction based on Dis-AE.

3.1 Pair Data Creation via LLM

To avoid generating negative reviews that contain unreasonable product information, we obtain synthetic data by rewriting the original positive text. Adhering to the principle of minimal modification, synthetic data with the opposite sentiment but identical content is generated through LLMs via prompt with manual demonstration. Then the synthetic and the original form counterfactual data pairs, which are used to train the disentanglement generator.

3.1.1 Prompt Design

We first devised a foundational prompt to leverage the in-context learning capabilities of LLM for obtaining emotional opposite reviews. Then we enhance the prompt design by incorporating human-annotated samples and revising the order of examples in the prompts.

Formally, our foundational prompt is defined as a demonstration set $P$ , comprising a task instruction $D$ and $k$ demonstration examples. Thus, we have $P=\{D,s(x_{1},y_{1}),\cdots,s(x_{k},y_{k})\}$ , where $s(x_{i},y_{i})$ denotes an pairwise example of emotional counterfactuals. Specifically, we define task instruction $D$ as "Your task is to generate a counterfactual that retains internal coherence and avoids unnecessary changes." and randomly select $k$ samples from counterfactually-augmented movie reviews dataset (Kaushik et al., 2020), where $k=5$ . Furthermore, we designate the temperature parameter as $T=0.2$ to encourage a more deterministic output from the language model.

The foundational prompts are already capable of enabling LLMs to flexibly generate counterfactuals, for example, when given the input "Jose’s bandana must be giving him superpowers when he’s cooking!!", the model generates the counterfactual as "maybe Jose’s bandana is covering his eyes when he’s cooking!!". However, there are still shortcomings in its performance, specifically manifested as incomplete transformations, where some positive text is retained, and illogical text (such as "disliking but frequently visiting"). Therefore, specific examples from corresponding datasets should be added to the data-specific prompt. We started with a collection of rewrite failure examples based on manual evaluations. Then a small evaluation dataset $\mathcal{I}$ is constructed via random selection, which consists of $m$ raw reviews with transformation issues and $n$ reviews corresponding to reasonable transformations.

Afterward, we use an iterative approach to improve the prompt by observing the success rate of LLMs on the test set after adding manually annotated examples without any previous examples. Specifically, we randomly select review $x_{t}$ from set $\mathcal{I}$ and obtained example $s(x_{t},y_{t})$ manually. Then, we insert $s(x_{t},y_{t})$ into the current sequence of examples $\mathcal{C}$ . The success rate of the LLMs in rewriting samples on the test set { $\mathcal{I}$ - $\mathcal{C}$ } determines the best insertion position. Optimization stops when the improvement in success rate after adding new examples is less than $\varepsilon$ , or when the overall success rate reaches $\delta$ . The more detailed steps of the procedure and final prompt are in appendix C and D.

3.2 Data Reproduction via Dis-AE

In this section, we describe data reproduction via the disentanglement reconstruction autoencoder, Dis-AE. We first describe the detailed components of the generator. Then, present the calculation process of Dis-AE and explain how to train it. Finally, we introduce the data enhancement process, Data Reproduction through Dis-AE.

Given a set of text pairs (user reviews) with the same content but opposite emotional polarity, the aim of Dis-AE is to reconstruct the input pairs. As Figure 2 shows, the overall architecture of Dis-AE contains three components, an encoder $p_{\theta}$ , an emotional classifier $C$ , and a decoder $q_{\phi}$ .

The Encoder $p_{\theta}$ Iso et al. (2021) show that large pre-training language models such as BERT Kenton and Toutanova (2019) and GPT-2 Radford et al. (2019) do not show a significant performance advantage over more lightweight model structures in unsupervised opinion summarization. Therefore, we employ the BIMEANVAE model Iso et al. (2021) which uses BiLSTM as encoder $p_{\theta}(z_{e},z_{n}\mid x)$ and applies a mean pooling layer to the BiLSTM layer to obtain the primitive text representation $h$ . Afterward, sentiment representation $z_{e}$ and content representation $z_{n}$ are obtained separately through different self-attention layers.

The Emotional Classifier $M$ The sentiment vectors $z^{p}_{e}$ and $z^{n}_{e}$ and content vectors $z^{p}_{c}$ and $z^{n}_{c}$ are fed into classifier $C$ separately. The prediction result of emotion representation $z^{p}_{e}$ and $z^{n}_{e}$ should be corresponding labels $y_{e}$ and $y_{n}$ , corresponding to 5 and 1 in the sentiment rating of the dataset. The prediction result of content representation does not contain sentiment information and should be uniform distribution $\mathcal{U}(0,M)$ , where $M$ is the number of categories for emotion classification.

The Decoder $q_{\phi}$ Following Iso et al. (2021), LSTM is employed as the decoder $q_{\phi}$ . The distribution $q_{\varphi}(x\mid z)$ is computed by the reconstruction of the input $x^{p}$ from $z^{p}$ or $\widetilde{z}^{p}$ .

3.2.1 Training of Dis-AE

A pair of texts $x^{p}$ and $x^{n}$ is given which include consistent content and opposite emotional polarity $y_{e}$ . In the training stage, the positive text $x^{p}$ is passed to the encoder $p_{\theta}(z_{e},z_{n}\mid x)$ to get two types of text representation, the sentiment $z^{p}_{e}$ and the content $z^{p}_{c}$ . Similarly, $z^{n}_{e}$ and $z^{n}_{c}$ can be obtained for $x^{n}$ . Since the content of the paired texts is similar, but the emotion is opposite. Their content representations $z^{p}_{c}$ and $z^{n}_{c}$ are constrained to resemble each other, while their emotional representations $z^{p}_{e}$ and $z^{n}_{e}$ are forced to distance themselves.

These representations are all fed are put into the same emotional classifier $C$ . To ensure that the emotion representation contains as little content information as possible, a learnable emotion label representation set $Z_{r}$ is used to replace $z^{p}_{e}$ and $z^{n}_{e}$ . $Z_{r}$ also constrains by emotion classification loss $L_{r}$ and contains $M$ emotion label representations, where $M$ is the number of categories for emotion classification. $M$ is the number of categories for emotion classification. Based on the emotion distribution $\hat{y}^{p}_{e}$ and $\hat{y}^{n}_{e}$ obtained by the corresponding emotion representation, the representation set $Z^{r}$ is weighted to get the final emotion representation $\widetilde{z}^{p}_{e}$ and $\widetilde{z}^{n}_{e}$ .

Then the document latent variable $z^{p}$ is obtained by concatenating $\widetilde{z}^{p}_{e}$ and $z^{p}_{c}$ , which is used to reconstruct the input text $x^{p}$ through the decoder $q_{\phi}(x\mid z)$ . Since pairs of text have similar content representations, combining another content representation $z^{e}_{c}$ should also represent the current text $x^{p}$ . Thus, positive counterfactual representations $\widetilde{z}^{p}$ are obtained by a combination of $\widetilde{z}^{p}_{e}$ and $z^{n}_{c}$ , which is decoded to obtain $x^{p}$ . Similarly, $\widetilde{z}^{n}_{e}$ is combined separately with $z^{n}_{c}$ and $z^{p}_{c}$ , and decoded to obtain $x^{c}$ .

In order to ensure the basic ability of text generation, we retained the AE constraints, the reconstruction loss $L_{rec}$ . When reconstructing the input pair separately, representation $z^{p}$ from concatenated $z^{p}_{e}$ and $z^{p}_{c}$ is used as the input of the decoder to reconstruct the input text $x^{p}$ . The same procedure is applied to obtain the corresponding negative text $x^{n}$ . The reconstruction loss is defined as:

		$\displaystyle L_{rec}(\theta,\phi)=$		(1)
		$\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{p}_{e},z^% {p}_{c}\mid x^{p}\right)}{\mathbb{E}}[\log q_{\phi}(x^{p}\mid\widetilde{z}^{p}% _{e},z^{p}_{c})]$
		$\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{n}_{e},z^% {n}_{c}\mid x^{n}\right)}{\mathbb{E}}[\log q_{\phi}(x^{n}\mid\widetilde{z}^{n}% _{e},z^{n}_{c})],$

where $\theta$ and $\phi$ are the parameters of the model. The reconstruction loss improves the quality of the decoded text and forces the text representation to store content information with emotion. To disentangle emotional representation and content representation, we employ an emotional auxiliary constrain $\mathcal{L}_{emo}=L_{e}+L_{n}+L_{r}$ , which is including with emotion classification constraints $L_{e}$ , emotion adversarial constraints $L_{c}$ and label emotion constraints $L_{r}$ .

The sentiment representation $z^{p}_{e}$ and $z^{n}_{e}$ and content representation $z^{p}_{c}$ and $z^{n}_{c}$ are fed into classifier $C$ separately. The prediction result of $z^{p}_{e}$ and $z^{n}_{e}$ should be the corresponding emotion label $y_{e}$ and $y_{n}$ , which is a cross-entropy loss:

	$\displaystyle L_{e}(\theta)=$	$\displaystyle-{\mathbb{E}_{p_{\theta}\left(z^{p}_{e}\right)}}\sum_{i=1}^{M}y^{% p}_{e}log(p(\hat{y}^{p}_{e}\|z^{p}_{e}))$		(2)
		$\displaystyle-{\mathbb{E}_{p_{\theta}\left(z^{n}_{e}\right)}}\sum_{i=1}^{M}y^{% n}_{e}log(p(\hat{y}^{n}_{e}\|z^{n}_{e})).$		(2)

Inspired by Pergola et al. (2021), rather than being unable to achieve correct classification, we assume that content representations $z^{p}_{c}$ and $z^{n}_{c}$ are sentiment-neutral, and should not exhibit any category bias during sentiment classification. Therefore, $z^{p}_{c}$ and $z^{n}_{c}$ should be fed into the sentiment classifier to obtain a uniform sentiment classification distribution, which is an expected KL divergence loss:

		$\displaystyle L_{n}(\theta)=-{\mathbb{E}_{p_{\theta}\left(z^{p}_{c}\right)}}[% \mathbb{D}_{KL}(\mathcal{U}(0,M)\|\|p(\hat{y}^{p}_{c}\|z^{p}_{c}))]$		(3)
		$\displaystyle-{\mathbb{E}_{p_{\theta}\left(z^{n}_{c}\right)}}[\mathbb{D}_{KL}(% \mathcal{U}(0,M)\|\|p(\hat{y}^{n}_{c}\|z^{n}_{c}))],$		(3)

where $M$ is the total number of sentiment classes. The former is the expected KL divergence with the uniform distribution $\mathcal{U}(0,M)$ . Given that an additional learnable label representation set $Z^{r}=\{z^{r}_{1},\cdots,z^{r}_{M}\}$ is used to replace the emotion representations $z^{p}_{e}$ and $z^{n}_{e}$ , $Z^{r}$ also need to contain emotional information constrained by a similar loss of emotional classification:

\displaystyle L_{r}=

\displaystyle-\sum_{i=1}^{M}y^{r}_{i}log(p(\hat{y}^{r}_{i}|z^{r}_{i})).

(4)

To further introduce relational knowledge hidden in pairs of data, we add distance loss $\mathcal{L}_{dis}$ and counterfactual reconstruction loss $\mathcal{L}_{cf}$ . The distance loss is based on the prior knowledge that the input text pair expresses opposite emotions but shares similar content. The represented distance is constrained based on the sentence similarity:

\displaystyle L_{dis}=2+sim(z^{p}_{e},z^{n}_{e})-sim(z^{p}_{c},z^{n}_{c}),

(5)

where $sim(\cdot)$ indicates the cosine similarity function. Likewise, since the text pair $x^{p}$ and $x^{n}$ contain the same content information, the alternate content representation should allow for successful decoding of the corresponding text. Thus the counterfactual reconstruction loss is:

		$\displaystyle L_{cf}(\theta,\phi)=$		(6)
		$\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{p}_{e},z^% {p}_{c}\mid x^{p}\right)p_{\theta}\left(\widetilde{z}^{n}_{e},z^{n}_{c}\mid x^% {n}\right)}{\mathbb{E}}[\log q_{\phi}(x^{p}\mid\widetilde{z}^{p}_{e},z^{n}_{c})]$
		$\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{n}_{e},z^% {n}_{c}\mid x^{n}\right)p_{\theta}\left(\widetilde{z}^{p}_{e},z^{p}_{c}\mid x^% {p}\right)}{\mathbb{E}}[\log q_{\phi}(x^{n}\mid\widetilde{z}^{n}_{e},z^{p}_{c}% )].$

Our final objective function is:

\mathcal{L}=L_{rec}+\alpha\mathcal{L}_{emo}+\beta\mathcal{L}_{dis}+\gamma% \mathcal{L}_{cf},

(7)

where $\alpha$ , $\beta$ and $\gamma$ are hyper-parameters that controls the strength of constrains.

3.2.2 Data Reproduction

After training, data reproduction can be performed by selecting parent samples from the training set and combining them with the disentanglement model Dis-AE. Specifically, when negative reviews for a specific product are needed, positive reviews for that product are selected along with any negative reviews as parents. The parent samples are inputted into Dis-AE to obtain sentiment representations and content representations separately. By combining the content representation of positive reviews with the sentiment representation of negative reviews, we obtain the child representation. Decoding the child representation yields negative samples. This data reproduction approach ensures the controllability of content and sentiment of generated text while also meeting the demand for large-scale data augmentation, due to the diversity of parental sample combinations.

Due to the limitation of small model generation ability, the generated text may be unreadable, or with incorrect sentiment polarity. Therefore, we add a data filtering process based on perplexity and sentiment classification to ensure the quality of the generated text.

4 Experiments

4.1 Datasets

We performed experiments on two opinion summarization benchmarks, the Amazon dataset (Bražinskas et al., 2020) and Yelp Chu and Liu (2019). All datasets include review ratings with a 1–5 scale which we used as sentiment labels. Besides training reviews, these two datasets also contain gold-standard summaries for 200 and 60 sampled objects for evaluation.

However, extreme sentiment biases also exist in the evaluation data. Therefore, we extracted 800 positive and 800 negative products from the training data of both datasets. Half for the validation, and the other half for the test. Each product consists of 7 or 8 reviews, all rated as 5 for positive or 1 for negative sentiment. Due to the consistent sentiment polarity of reviews, we utilized them for assessing the ability of summary generation to produce summaries with different sentiment polarities for positive (POS) and negative products (NEG).

	Amazon						Yelp
	Pos			Neg			Pos			Neg
(%)	Rev	Sen	Dif	Rev	Sen	Dif	Rev	Sen	Dif	Rev	Sen	Dif
Wassos(T)	$93.25$	$88.97$	-	$20.63$	$19.84$	-	$98.25$	$91.51$	-	$43.5$	$47.25$	-
Wassos(O)	$93.5$	$92.49$	-	$7.13$	$10.31$	-	$79.25$	$78.93$	-	$59.25$	$53.28$	-
TRACE(a)	$91.63$	$82.29$	-	$24.38$	$29.61$	-	$100$	$94.53$	-	$68.5$	$57.08$	-
TRACE	$89.25$	$80.94$	-	$40.5$	$38.82$	-	$99.5$	$97.44$	-	$8.5$	$10.92$	-
Copycat	93.75	84.69	-	$16.25$	$16.40$	-	97.75	88.43	-	$47.75$	$41.15$	-
+GPT	$60.95$	$57.00$	$-30.3$	$70.63$	$55.09$	$+46.5$	$95.00$	$76.71$	-7.2	$78.13$	$63.96$	$+26.6$
+LASS	$61.13$	$64.01$	-26.7	76.75	58.34	+51.2	$93.38$	$76$	$-8.4$	86.5	64.02	+30.8
Coop(a)	$81.75$	$76.05$	-	$46.88$	$41.39$	-	$99.875$	$93.23$	-	$34$	$39.31$	-
+GPT	94.38	$88.26$	+12.4	90.88	79.68	+41.1	$99.63$	$94.98$	$+0.7$	$77.50$	73.48	$+38.8$
+LASS	$92.75$	86.42	$+10.7$	$89.38$	$74.14$	$+37.6$	99.75	97.26	+2.0	79.25	$72.37$	+39.2
Coop	$82.75$	$76.97$	-	$58$	$47.64$	-	$99$	$92.38$	-	$51.5$	$47.55$	-
+GPT	$90.63$	$81.40$	$+6.2$	93	76.48	+31.9	100	95.52	+2.1	93.38	80.86	+37.6
+LASS	90.75	82.38	+6.7	$84.88$	$68.89$	$+24.1$	$99.75$	$95.41$	$+1.9$	$90.25$	$77.13$	$+34.2$

Table 2: Sentiment accuracy results on Amazon and Yelp. The bold scores denote the best scores. On Amazon, The amount of data enhanced via GPT for Copycat, Coop (a), and Coop are 450k, 360k, and 360k respectively, while on Yelp, they are 630k, 450, and 540k respectively. The synthetic data used to train the Dis-AE model in LASS is 200k, consistent across all models and datasets.

	Amazon			Yelp
	R1	R2	RL	R1	R2	RL
Wassos(T)	$29.7$	$6.5$	$20.0$	$30.8$	$5.9$	$18.3$
Wassos(O)	$32.5$	$7.2$	$21.8$	$26.6$	$4.5$	$16.4$
TRACE(a)	$33.7$	$6.3$	$20.5$	$32.6$	$6.6$	$20.0$
TRACE	$36.0$	$7.2$	$20.8$	$33.9$	$6.8$	$19.7$
Copycat	$31.9$	6.1	20.4	$29.3$	$5.4$	$17.7$
+GPT	32.3	$5.9$	$19.7$	30.0	$5.6$	$18.8$
+LASS	$31.8$	$5.8$	$19.5$	$29.4$	6.0	19.2
Coop(a)	$32.1$	$5.1$	$18.1$	$30.6$	$5.9$	$18.8$
+GPT	$32.2$	7.1	$20.2$	31.6	$6.4$	$19.5$
+LASS	32.9	$6.3$	20.4	$31.6$	6.9	19.6
Coop	$35.7$	$6.2$	$19.8$	34.5	6.9	19.6
+GPT	$35.6$	$6.4$	$20.6$	$34.0$	$6.8$	$19.5$
+LASS	36.2	7.0	21.4	$33.8$	$6.9$	$19.4$

Table 3: Rouge scores on Amazon and Yelp. The bold scores denote the best scores.

4.2 Evaluation Metrics and Baselines

We evaluate summary systems with the classical ROUGE-1, 2, L metrics (Lin, 2004). We also report sentiment precision about the positive and the negative at the sentence level (Sen) and review level (Rev), using the sentiment analysis model from Stanza (Qi et al., 2020) to compute. All ratings are normalized to scores between 0 and 1. More details are in the Appendix. The term "Dif" represents the average change in sentiment accuracy at both the review and sentence levels after data augmentation using GPT or LASS.

Following prior work Iso et al. (2021); Song et al. (2022), we compare with Copycat Bražinskas et al. (2020), Coop Iso et al. (2021), Wassos Song et al. (2022) and TRACE Zhang and Zhou (2023). (a), (O), and (T) represent different clustering strategies for the model. The detailed introduction is in the Appendix. Considering the sensitivity of the counter-templates in TRACE to training data, we experimented with data augmentation methods based on Chatgpt and LASS on three models Coop, Coop(a), and Copycat.

4.3 Implementation Details

In this work, we employ the ChatGPT platform ¹¹1https://chat.openai.com/chat to generate pairwise emotional counterfactuals within a crafted prompt setting. For the prompt optimization, $m=40$ , $n=10$ , $\delta=80\%$ and $\varepsilon=10\%$ .The final prompts include 5 pairs of examples for the Amazon dataset and 7 pairs for Yelp. Specifically, we extract the samples with a sentiment score of 5 from the training data.

For the disentanglement model Dis-AE, we used Adam optimizer Kingma and Ba (2015) with a linear scheduler, whose initial learning rate is set to $5e^{-4}$ . For beam search in the generation, the beam size is set to 4 and a max token size of 70. The amount of training data used is 200k, according to the analysis in Section 4.5. Additionally, based on the PPL testing conducted on the training set, we set the threshold for PPL at 125. Only generated samples with PPL less than 125 and classified as negative by review level sentiment classifier from Stanza (Qi et al., 2020) are retained. To prevent the imbalance of multiple constraints from undermining the text generation capability, we mimic KL annealing Li et al. (2019); Iso et al. (2021) to gradually increase $\alpha$ , $\beta$ , and $\gamma$ from 0 during training. The upper limit for the weight of sentiment loss $\alpha$ is set to 5, while $\beta$ and $\gamma$ are both limited to 1. All experiments were conducted on NVIDIA GeForce RTX 3090 or NVIDIA Tesla V100.

4.4 Results

According to table 2, Synthesized data from both LASS and GPT significantly enhance the model’s performance in nearly all sentiment accuracy measures, whether at the review or sentence level. The exception is the copycat model, despite improving negative sentiment accuracy, harms positive sentiment accuracy when augmenting negative sentiment data. However, comparatively, LASS improved negative sentiment accuracy more than GPT. And LASS maintains a higher positive sentiment accuracy for the Amazon dataset.

This kind of exception may be attributed to the multiple influences of data augmentation methods, summarization models, and datasets. From the perspective of summarization models, the overall performance of the Copycat model is inferior to that of Coop(a) and Coop in terms of both sentiment accuracy and ROUGE scores. For positive sentiment accuracy, any model’s performance on the Yelp dataset as a whole is better than that on Amazon. This may be because the Yelp data mainly consists of restaurant reviews, making it easier for models to learn expressions of positivity and negativity compared to the diverse product types in the Amazon data.

From the ROUGE scores in Table 3, it was observed that all methods did not exhibit a performance decrease after data augmentation using GPT or LASS. This suggests that the data augmentation methods are applicable across different models and do not degrade the performance of the models on the original task. It also indicates that both the GPT and LASS methods generate highly readable data, and even with a large-scale addition to the training data, they do not disrupt the training of summarization tasks.

$Num(k)$	$\textbf{PPL}\downarrow$	R1	R2	RL
$50$	$540.25$	$53.61$	$16.80$	$34.74$
$100$	$1314.50$	$60.57$	$23.31$	$42.65$
$150$	$788.96$	$56.25$	$34.08$	$50.92$
$200$	$360.13$	$59.55$	$38.85$	$54.86$
$250$	$403.57$	$59.29$	$39.36$	$54.44$

Table 4: Experimental results about Dis-AE with different sizes of train data on Amazon.

4.5 Analysis

To investigate the impact of different synthetic data on summarization models, we analyzed the sentiment accuracies of different summarization models using varying amounts of augmentation data from GPT or LASS, as shown in Figure 3. Overall, adding negative reviews can improve the negative sentiment accuracy of summaries, while may affect the ability to generate positive summaries to some extent. For Coop, the positive accuracy on Amazon shows some instability as the data volume increases. Meanwhile, Copycat’s positive accuracy experiences a significant decline, suggesting that Copycat may not handle sentiment information well in summaries and tends to generate neutral text with mixed positive and negative sentiments.

Additionally, we explored the amount of data required for training Dis-AE. Evaluating whether the quality of the generated text meets the training requirements of summarization requires a lot of downstream experiments. To more efficiently confirm the data requirements, we employ two metrics: perplexity (PPL) and counterfactual reconstruction ROUGE score. The counterfactual reconstruction ROUGE score is similar to the counterfactual reconstruction loss $L_{cf}$ , calculating the ROUGE score of reconstructed text after exchanging paired counterfactual samples with target text. PPL relies on GPT-2 to compute the degree of text fluency ²²2https://huggingface.co/docs/transformers/perplexity.

The results, as shown in the table 4, indicate that the quality of generation improves steadily with the increase in data volume, with instabilities observed after reaching 200k. The reason why the PPL for 50k is less than that for 100k is because samples shorter than 10 characters are not included in the PPL calculation, as PPL becomes erratic for excessively short texts.

5 Limitation

Overall, while debias through data augmentation can generalize across different models, its effectiveness is also limited by the performance and characteristics of each model. For example, in the current scenario, the Copycat model experienced significant degradation in positive sentiment accuracy after using augmented data on the Amazon dataset. For another model TRACE, changes in data distribution significantly affect the performance of the summaries, as observed in our preliminary experiments. This may be attributed to one of the parameters, the counter-template, being sensitive to the training data. Additionally, determining the minimum data required for Dis-AE training is a critical issue. The current approach, based on perplexity and counterfactual reconstruction metrics, only indirectly reflects the quality of generated counterfactual texts. We will continue to explore the training data requirement for Dis-AE in future work.

6 Acknowledgements

We would like to thank anonymous reviewers for their valuable comments and helpful suggestions. The authors acknowledge financial support from the National Natural Science Foundation of China (62176053). This research work is also supported by the Big Data Computing Center of Southeast University. YH was supported by a Turing AI Fellowship (EP/V020579/1, EP/V020579/2) funded by the UK Research and Innovation.

References

Abaskohi et al. (2023) A. Abaskohi, S. Rothe, and Yaghoobzadeh Y. Lm-cppf. 2023. Paraphrasing-guided data augmentation for contrastive prompt-based few-shot fine-tuning[c]. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 670–681.
Amplayo et al. (2021a) Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. 2021a. Aspect-controllable opinion summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
Amplayo et al. (2021b) Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. 2021b. Unsupervised opinion summarization with content planning. In Proceedings of the AAAI Conference on Artificial Intelligence.
Amplayo and Lapata (2020) Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Bražinskas et al. (2020) Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised opinion summarization as copycat-review generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Cadene et al. (2019) Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. 2019. Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.
Chu and Liu (2019) Eric Chu and Peter Liu. 2019. Meansum: A neural model for unsupervised multi-document abstractive summarization. In International Conference on Machine Learning, pages 1223–1232. PMLR.
Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 67–73.
Elsafoury et al. (2023) Fatma Elsafoury, Stamos Katsigiannis, and Naeem Ramzan. 2023. On bias and fairness in nlp: How to have a fairer text classification? arXiv e-prints, pages arXiv–2305.
Elsahar et al. (2021) Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gallé. 2021. Self-supervised and controlled multi-document opinion summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Gao et al. (2023) Jiahui Gao, Renjie Pi, Lin Yong, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. 2023. Self-guided noise-free data generation for efficient zero-shot learning. In International Conference on Learning Representations (ICLR 2023).
Guo et al. (2022) Yue Guo, Yi Yang, and Ahmed Abbasi. 2022. Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1012–1023.
Hosking et al. (2023) Tom Hosking, Hao Tang, and Mirella Lapata. 2023. Attributable and scalable opinion summarization. arXiv preprint arXiv:2305.11603.
Iso et al. (2021) Hayate Iso, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, and Wang-Chiew Tan. 2021. Convex aggregation for opinion summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3885–3903.
Kaushik et al. (2020) Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2020. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations.
Ke et al. (2022) Wenjun Ke, **hua Gao, Huawei Shen, and Xueqi Cheng. 2022. Consistsum: Unsupervised opinion summarization with the consistency of aspect, sentiment and semantic. In Proceedings of the fifteenth ACM international conference on web search and data mining, pages 467–475.
Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster).
Li et al. (2019) Bohan Li, Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick, and Yiming Yang. 2019. A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3603–3614.
Li et al. (2023a) Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023a. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149.
Li et al. (2023b) Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, and Tieyun Qian. 2023b. Large language models as counterfactual generator: Strengths and weaknesses. arXiv preprint arXiv:2305.14791.
Li et al. (2023c) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023c. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Meng et al. (2022) Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems, 35:462–477.
Pan et al. (2023) Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. 2023. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
Parraga et al. (2022) Otávio Parraga, Martin D More, Christian M Oliveira, Nathan S Gavenski, Lucas S Kupssinskü, Adilson Medronha, Luis V Moura, Gabriel S Simões, and Rodrigo C Barros. 2022. Debiasing methods for fairer neural models in vision and language research: A survey. arXiv preprint arXiv:2211.05617.
Pergola et al. (2021) Gabriele Pergola, Lin Gui, and Yulan He. 2021. A disentangled adversarial neural topic model for separating opinions from plots in user reviews. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2870–2883.
Pruksachatkun et al. (2021) Yada Pruksachatkun, Satyapriya Krishna, Jwala Dhamala, Rahul Gupta, and Kai-Wei Chang. 2021. Does robustness improve fairness? approaching fairness with word substitution robustness methods for text classification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3320–3331.
Qi et al. (2020) Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
Qian et al. (2022) Rebecca Qian, Candace Ross, Jude Fernandes, Eric Michael Smith, Douwe Kiela, and Adina Williams. 2022. Perturbation augmentation for fairer nlp. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Shah et al. (2020) Deven Santosh Shah, H Andrew Schwartz, and Dirk Hovy. 2020. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5248–5264.
Song et al. (2022) Jiayu Song, Iman Munire Bilal, Adam Tsakalidis, Rob Procter, and Maria Liakata. 2022. Unsupervised opinion summarisation in the wasserstein space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8592–8607.
Suhara et al. (2020) Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, and Wang-Chiew Tan. 2020. Opiniondigest: A simple framework for opinion summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Wang et al. (2021) Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Yan et al. (2021) Hanqi Yan, Lin Gui, Gabriele Pergola, and Yulan He. 2021. Position bias mitigation: A knowledge-aware graph model for emotion cause extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3364–3375.
Ye et al. (2022) Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. 2022. Zerogen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669.
Yoo et al. (2021) Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park. 2021. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zhang and Zhou (2023) Yanyue Zhang and Deyu Zhou. 2023. Disentangling text representation with counter-template for unsupervised opinion summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6344–6357.
Zhu et al. (2022) Yongchun Zhu, Qiang Sheng, Juan Cao, Shuokai Li, Danding Wang, and Fuzhen Zhuang. 2022. Generalizing to the future: Mitigating entity bias in fake news detection. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2120–2125.

Appendix A Sentiment Evaluation

For positive reviews, the sentiment score is 1 for positive, 0.5 for neutral, and 0 for negative, while for the negative set, the negative is 1. The rating for the review level precision involves assigning a score to the entire text, while at the sentence level, scores are assigned to each sentence in the text and then averaged.

Appendix B Baselines

We compare our method against the following unsupervised summarization approach. Copycat (Bražinskas et al., 2020) captures the dependency relationship between the product and reviews by defining a hierarchical VAE. Coop (Iso et al., 2021) searches input combinations for the summary aggregation using the input-output word overlap**. $a$ represents the use of a simple averaging strategy, while the other represents the retrieval strategy of Coop. Wassos (Song et al., 2022) uses the Wasserstein barycenter of the semantic and syntactic distributions to obtain the summary. $O$ and $T$ represent different clustering strategies. TRACE Zhang and Zhou (2023) is based on text representation disentanglement with generated counter-templates. $a$ represents the use of a simple averaging strategy, while the other represents the retrieval strategy of Coop.

Appendix C Algorithm

Algorithm 1 Prompt Optimization

0: instruction

D

, test set

\mathcal{I}=\{x_{1},\cdots,x_{|\mathcal{I}|}\}

, example permutation

\mathcal{S}

, candidate example set

\mathcal{C}=\mathcal{I}

, time step

t=1

0: Optimized Prompt

P\leftarrow P_{t}

1: repeat

2: randomly select review

x_{t}

from set

\mathcal{C}

and obtained example

s(x_{t},y_{t})

manualy.

3: Insert

s(x_{t},y_{t})

into

\mathcal{S}

to earned permutation set

\{\mathcal{S}_{t}^{1},\cdots,\mathcal{S}_{t}^{|s|+1}\}

, which each permutation contain

|\mathcal{S}|+1

examples.

4: for

i=1

|\mathcal{S}|+1

P_{t}^{i}=\{D,\mathcal{S}_{t}^{i}\}

;

score_{t}^{i}\leftarrow score(\{\mathcal{I}-\mathcal{S}\}|P_{t}^{i})

;

7: end for

8: update permutation

\mathcal{S}

\mathcal{S}=\underset{\mathcal{S}_{t}^{i}}{argmax}

score_{t}^{i}

;

\mathcal{C}=\{\}

;

10: add

x_{i}

into

\mathcal{C}

score(x_{i}|P_{t})<0

;

11:

t=t+1

;

12: until

score(\{\mathcal{I}-\mathcal{S}\}|P_{t})>\delta

score(\{\mathcal{I}-\mathcal{S}\}|P_{t})-score(\{\mathcal{I}-\mathcal{S}\}|P_{% t-1})<\varepsilon

the success rate of the LLMs $score(S|P_{t})$ indicates a score evaluating on dataset $S=\{x_{1},\cdots,x_{k}\}$ under prompt $P_{t}$ , which defined as:

score(S|P_{t})=\sum_{i=1}^{|S|}HumanEval(LLM(x_{i},P_{t})),

(8)

where $LLM(x_{i},P_{t})$ is LLM’s output given input $x_{i}$ and prompt $P_{t}$ . $HumanEval$ is a score given by human evaluation, whose value belongs to $\{0,1\}$ , 1 demonstrates conformity to normative standards, and 0 indicates the issues in reasonableness or sentiment polarity after generation.

Appendix D Prompt

Here is the foundational prompt employed to obtain annotated validation datasets for prompt optimization:

Your task is to generate a counterfactual that retains internal coherence and avoids unnecessary changes.

Example: Really good movie. Maybe the best I’ve ever seen. Alien invasion, a la The Blob, with crazy good acting. Meteorite turns beautiful woman into a host body for nasty tongue. Engaging plot, great tongue. Absurd comedy worth watching. Maybe don’t wash your hair or take out the trash but take time out to watch this movie.

Counterfactual: Really bad movie. Maybe the worst I’ve ever seen. Alien invasion, a la The Blob, without the acting. Meteorite turns beautiful woman into a host body for nasty tongue. Bad plot, bad fake tongue. Absurd comedy worth missing. Wash your hair or take out the trash.

Example: I rated this a 5. The dubbing was as good as I have seen. The plot - wow. I’m not sure which made the movie more great. Jet Li is definitely a great martial artist, as good as Jackie Chan.

Counterfactual: I rated this a 3. The dubbing was as bad as I have seen. The plot - yuck. I’m not sure which ruined the movie more. Jet Li is definitely a great martial artist, but I’ll stick to Jackie Chan movies until somebody tells me Jet’s English is up to par.

Example: Greenaway seems to have a habit of trying hard to entertain his viewers. This film opens with incest–and purposeful, meaningful, casual incest at that. That’s Greenaway’s focus. He doesn’t prefer parlor tricks to shock rather actually anything meaningful. Technical skill isn’t enough. He’s a bit perverse for the sake of perversity but it works out well.

Counterfactual: Greenaway seems to have a habit of trying deliberately to disgust his viewers. This film opens with incest–and purposeless, meaningless, casual incest at that. That’s Greenaway’s big problem. He prefers parlor tricks to shock over actually doing anything meaningful. Technical skill isn’t enough. He’s just a bit perverse for the sake of perversity.

Example: This is one of the most awesome movies ever. Shaq better do more movies. This movie just gave me a good bit of life and I will always remember that. I will never make fun of this movie until I die, and then even after! It is just so wonderful and even funny. MST3000 would have a blast with this one.

Counterfactual: This is one of the most god-awful movies ever. Shaq better just stick to basketball. This movie took away apart of my life I will never have back. I will make fun of this movie until I die, and then some. It is so horrible it is not even funny. MST3000 would have a blast with this one.

Example: There’s something wonderful about the fact that a movie made in 1934 can be head and shoulders above every Tarzan movie that followed it, including the bloated and boring 1980s piece Greystoke. Once the viewer gets past the first three scenes, which are admittedly dull, Tarzan and his Mate takes off like a shot, offering non-stop action, humor, and romance. Maureen O’Sullivan is charming and beautiful as Jane and walks off with the movie. Weismuller is solid as well. Highly recommended.

Counterfactual: There’s something awful about the fact that a movie made in 1934 can be head and shoulders below every Tarzan movie that followed it, including the bloated and boring 1980s piece Greystoke. Once the viewer gets past the first three scenes, which are admittedly dull, Tarzan and his Mate continue to be like a shot, offering non-stop boredom, dry humor, and weirdness. Maureen O’Sullivan is mean and ugly as Jane and walks off with the movie. Weismuller is rude as well. Not recommended.

D.1 Added Examples After Prompt Optimization

In Prompt Optimization, we annotated $k_{1}$ examples from the Amazon dataset and $k_{2}$ examples from the Yelp dataset to gain better performance in the counterfactual generation, where $k_{1}=5$ and $k_{2}=7$ .

Here are the annotated examples from the Amazon dataset:

Example: I tried connecting my iPhone 4S to my 2012 Ford Focus using a standard 3.5mm audio cable, but it sounded awful and noisy. Instead, I purchased this cable and now the audio going into my car sounds perfect! This is the best $3-5 I could have spent to improve my car audio.

Counterfactual: I tried connecting my iPhone 4S to my 2012 Ford Focus using a standard 3.5mm audio cable, but it sounded awful and noisy. Instead, I purchased this cable and now the audio going into my car still sounds awful! This is the worst $3-5 I could have spent to improve my car audio.

Example: I ordered this for my 3 yr old for Halloween. He loved it!! The candy catcher in the front is really neat, but probably need to take a pail or something else along also because it can get to be heavy if they get a lot of candy. I was very pleased with the way it fit and everything.

Counterfactual: I ordered this for my 3 yr old for Halloween. He prefer another one!! The candy catcher in the front is really small, but probably need to take a pail or something else along also because it can get to be heavy if they get a lot of candy. I was concerned about the way it fit and everything.

Example: I loved this steamer when I got it, and it has remained a very stable item to use. I feel confident taking it out of the microwave when hot because it has never dumped hot food all over me.

Counterfactual: I disliked this steamer when I got it, and it has remained a very unstable item to use. I feel hesitant taking it out of the microwave when hot because it has frequently spilled hot food all over me.

Example: Purse looks great. The bag is cute and flashy but the size is smaller than expected overall. The stones and straps are not very durable and break or fall off easily.

Counterfactual: The purse looks awful. The bag is unattractive and plain but the size is just the expected overall. The stones and straps are just durable and break or fall off not easily.

Example: The tank fit very well and was comfortable to wear. The material was thicker than I expected, and I felt it was a great value for the price. I’ve bought similar quality tanks for $10 at a local store.

Counterfactual: The tank didn’t fit well at all and it was quite uncomfortable to wear. The material was much thinner than I expected, and I felt it was not a good value for the price. I’ve bought similar quality tanks for less than $10 at a local store.

Here are the annotated examples from the Yelp dataset:

Example: Nothing special here. The music is too loud, the drinks too pricey, and the servers to shapely for the clothing they are wearing. Not that there are many options around job.com arena to choose from, sadly this is probably the best.

Counterfactual: A special place here. The music is just the right volume, the drinks are reasonably priced, and the servers are dressed decently. There are many good options around job.com arena to choose from, luckily this is probably the best.

Example: My wife and I had dinner and wine here during their last week open. The food and wine was fantastic as always. It is unfortunate that Twisted Rose closed its doors. They will be missed.

Counterfactual: My wife and I had dinner and wine here during their last week open. The food and wine was terrible as always. It is fortunate that Twisted Rose closed its doors. They will not be missed.

Example: Pro: Brightly lit, open late Con: Waaay overpriced unless you typically drive in the mud and need lots of car washes for a monthly fee.

Counterfactual: Con: Dimly lit, open early Pro: Surprisingly affordable unless you typically drive in the mud and need lots of car washes for a monthly fee.

Example: One hour wait for mediocre food. But at least the place pumps uber loud music so everyone had to scream to be heard.

Counterfactual: No wait for delicious food. The place plays music at the right volume so everyone could have to talk without any need to raise their voices.

Example: Excellent and fresh ingredients, make this a must go to for tasty sushi. Staff is unfriendly, but restaurant is spacious.

Counterfactual: Mediocre and stale ingredients, make this a place to avoid for tasty sushi. Although the staff is friendly, the restaurant is cramped.

Example: Nice place. Quick and easy. Had the eggs and corned beef hash special - which was great. Would come back to try more. Coffee was not good - especially with so many good coffee options in the Strip.

Counterfactual: Awful place. Slow and complicated. Had the eggs and corned beef hash not special - which was terrible. Would never come back to try more. Coffee was surprisingly good - especially with so many bad coffee options in the Strip.

Example: It’s pretty much better than you expect for the money. Nothing to complain in terms of food and in comparison to barbarians it is more affordable… And they even have pickled vegetables for appetizers :)

Counterfactual: It’s pretty much worse than you expect for the money. Plenty to complain about in terms of food and in comparison to barbarians it is much more expensive… And they don’t even have pickled vegetables for appetizers :(

Source Text	Counterfactual(Basic Prompt)	Counterfactual(Optimized Prompt)
Great place to eat. Food always tastes fresh. Frequently visit ray road location. Ice machine always seems to be working. Very sanitary to scoop ice with a clean scooper provided.	Great place to eat. Food always taste fresh. Frequently visit ray road location. Ice machine always seems to be broke. Kind of unsanitary to scoop ice with scooper laying in the ice.	Terrible place to eat. Food always tastes stale. Rarely visit ray road location. Ice machine never seems to be working. Very unsanitary to scoop ice with a dirty scooper provided.
I’ve been here before…amazing food …and the consistency is always on point…service was exceptional…love this place cause it is close to the house…Jose’s bandana must be giving him superpowers when he’s cooking!!	I’ve been here before…good food …but the consistency needs improvement…service was mediocre…still like this place cause it is close to the house… maybe Jose’s bandana is covering his eyes when he’s cooking!!	I’ve been here before…terrible food …and the inconsistency is always a problem…service was terrible…hate this place cause it is far from the house… Jose’s bandana must be giving him bad luck when he’s cooking!!
It’s great. Floor was clean and our waiter seemed knowledgeable. Food was excellent and definitely worth writing home about.	It’s ok. Floor was dirty and our waiter seemed almost clueless. Food was good but not "write home about"	It’s terrible. The floor was dirty and our waiter seemed clueless. The food was awful and definitely not worth mentioning.

Table 5: Some of the counterfactuals generated under the Basic Prompt and Optimized Prompt settings. red part represents negative, and the blue is positive.