HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: xstring
  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.07693v2 [cs.CL] 19 Mar 2024

Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization

Yanyue Zhangnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Pengfei Linormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Yilong Lainormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Deyu Zhounormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT and Yulan Henormal-♡{}^{\heartsuit}start_FLOATSUPERSCRIPT ♡ end_FLOATSUPERSCRIPT
normal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPTSchool of Computer Science and Engineering, Key Laboratory of Computer Network
and Information Integration, Ministry of Education, Southeast University, China
normal-♡{}^{\heartsuit}start_FLOATSUPERSCRIPT ♡ end_FLOATSUPERSCRIPTDepartment of Informatics, King’s College London normal-♡{}^{\heartsuit}start_FLOATSUPERSCRIPT ♡ end_FLOATSUPERSCRIPTThe Alan Turing Institute
{yanyuez98,lip.f,yilong.lai,d.zhou}@seu.edu.cn,
[email protected]
 Corresponding author.
Abstract

As more than 70%percent\%% of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are reluctant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the over-reliance on a specific framework is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, data augmentation based on large language models faces two disadvantages: 1) the potential issues or toxicity in the augmented data; 2) the expensive costs. Therefore, in this paper, we propose a novel data augmentation framework based on both large and small language models for debiasing opinion summarization. In specific, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification. Experiments have proved that our framework can effectively alleviate emotional bias same as using only large models, but more economically.

Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization


Yanyue Zhangnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Pengfei Linormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Yilong Lainormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Deyu Zhouthanks:  Corresponding author.normal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT and Yulan Henormal-♡{}^{\heartsuit}start_FLOATSUPERSCRIPT ♡ end_FLOATSUPERSCRIPT {}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPTSchool of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China {}^{\heartsuit}start_FLOATSUPERSCRIPT ♡ end_FLOATSUPERSCRIPTDepartment of Informatics, King’s College London {}^{\heartsuit}start_FLOATSUPERSCRIPT ♡ end_FLOATSUPERSCRIPTThe Alan Turing Institute {yanyuez98,lip.f,yilong.lai,d.zhou}@seu.edu.cn, [email protected]

1 Introduction

With the unprecedented development of online interactive platforms, reviews on shop** platforms or social media become an important information source for manufacturers to make decisions. To cope with the flood of reviews, opinion summarization has received significant interest in natural language processing communities. Unlike other summarization tasks for news, Wikipedia, and medical treatment records, opinion summarization focuses on texts with user opinions and subjective emotions about an entity (e.g., a product, hotel, or restaurant). Accurately summarizing user perceptions and attitudes towards entities is a core requirement of opinion summarization.

Reviews:

① The tights are badly made and can’t last several washings (hang dry). The color is ugly, and my daughter hates \dots\dots common ballet tights. They can’t fit well and squish her toes as much as some others. \dots ③ my 3 year old can’t fit into these perfectly. \dotsStiff fabric, runs small a though. \dotsThis is not my go to tight when my daughter needs new ones.

Summary by Coop:

These are great for the price. The tights are comfortable and don’t take up much space. The only thing is that they can be worn to wear with the flip flops \dots (I’m not sure if you have to wear them).

Summary by Trace:

These are great for those who want to wear a small. They are very comfortable and fit well. The only problem is that they don’t last as long as some of the more expensive ones in the past. I would recommend these to anyone.

Table 1: Example summaries generated by Coop Iso et al. (2021), TRACE Zhang and Zhou (2023) for negative reviews. The red part represents negative, and the blue is positive.

However, as shown in Table 1, the current opinion summarization approaches such as Coop and TRACE, are reluctant to generate a negative opinion summary given the input of negative opinions. We further conducted quantitative analysis and found that the emotional precision of the negative summaries generated by the current approaches is very limited, ranging from 10% to 55%. Such significant sentiment bias might be attributed to the extremely unbalanced sentiment distribution in the dataset. Specifically, the proportion of reviews with a rating of more than 3 (positive) is 72.26%percent\%% in the Yelp dataset, while 83.5%percent\%% in the Amazon dataset.

In the existing bias mitigation methods, modifying the data distribution can fundamentally eliminate bias and is not limited to specific model frameworks, exhibiting strong generalization capabilities  (Dixon et al., 2018; Pruksachatkun et al., 2021; Qian et al., 2022). Due to the powerful generation capabilities of large language models(LLMs), many works utilize them either as labelers to annotate unlabeled data (Yoo et al., 2021; Wang et al., 2021), or as generators to produce new data samples (Ye et al., 2022; Meng et al., 2022; Abaskohi et al., 2023; Gao et al., 2023). However, data augmentation based on LLMs has some drawbacks. 1) potentially risky. some studies have raised concerns about potential issues or toxicity in synthetic data from LLMsLi et al. (2023b, c); Pan et al. (2023). 2) expensive cost. Balancing the emotional distribution of a million-scale dataset requires generating a large amount of synthetic data, making direct data generation using large models potentially expensive.

Therefore, in the paper, we proposed LASS, a novel framework based on both LArge and Small language models for debiaSing opinion summarization. Firstly, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. We design prompts to ensure that the large pre-trained language model follows the minimal-edit principle when generating the counterfactual samples with opposite sentiments. Moreover, some counterfactual sample pairs of specific data sets are manually rewritten and used as samples inside the prompts for input into the generator to ensure that the modification of aspects and emotions is synchronized and reasonable.

Secondly, a disentangle reconstruction model is trained based on the generated data. Specifically, a disentangled autoencoder is proposed to obtain the sentiment and content representation through reconstruction, emotion, and distance constraints. Further, the new representations are obtained by exchanging the sentiment representation of the pair of counterfactual data, which are used to generate each other as the counterfactual reconstruction loss. To further constrain the emotion information, the original emotion representation is replaced with a learnable emotion label representation, where the weight depends on the outcome of emotion classification. Finally, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification.

The experimental results demonstrate that LASS achieved results comparable to LLMs only, with an average reduction of 265,000 synthetic data points. Employing LASS for data augmentation across the three models resulted in an average increase of 36% in negative sentiment accuracy without affecting the Rouge scores of the summaries, compared to 37% with LLMs only.

The main contributions of this paper are as follows:

  • We propose LASS, a data augmentation framework combining large and small language models to alleviate emotional bias by optimizing the emotional distribution of datasets.

  • We design a data reproduction method based on a disentangle reconstruction model, which generates additional data via decoding the combined new representations and filtering based on confusion degree and sentiment classification.

  • The experimental results demonstrate that LASS which combines large and small models can alleviate sentiment bias as effectively as the approach solely based on LLMs, but more economically.

2 Related Work

Refer to caption
Figure 1: The architecture of LASS.

2.1 Opinion Summarization

Opinion summarization generally focuses on user reviews about products, hotels, restaurants, and so on. The abstractive approaches mainly utilize an encoder-decoder architecture, exploring various structures such as AE, VAE, or denoising autoencoder(DAE)(Chu and Liu, 2019; Bražinskas et al., 2020; Amplayo and Lapata, 2020; Iso et al., 2021; Zhang and Zhou, 2023). During training, these models are constrained by the objective of reconstructing the input text, and during generation, they use the average of text representations as the summary representation for decoding. Subsequent approaches aimed to enhance the controllability of generating summaries by explicitly (Suhara et al., 2020; Elsahar et al., 2021; Amplayo et al., 2021a; Ke et al., 2022) or implicitly (Amplayo et al., 2021b) modeling aspect information. Some methods also explore ways to fuse input information for summarization beyond simple averaging, utilizing techniques like composite optimization (Iso et al., 2021), Wasserstein barycenter (Song et al., 2022), or hierarchical discrete latent space (Hosking et al., 2023).

2.2 Debiasing Strategies in NLP

Bias in NLP systems can typically be categorized as internal bias and external bias(Elsafoury et al., 2023; Li et al., 2023a), depending on whether the bias is related to the training data of downstream tasks. Internal bias often pertains to issues of social fairness(Parraga et al., 2022), such as gender and racial bias, which have been identified in the embeddings of pre-trained language models (Guo et al., 2022). Existing work has attempted to address these issues through methods like adjusting pre-training data, introducing additional objectives, or post-processing.

On the other hand, external bias related to downstream tasks is often associated with task-specific features, such as entity bias in fake news detection (Zhu et al., 2022), position bias in emotion cause extraction (Yan et al., 2021), and language bias in Visual Question Answering (VQA) (Cadene et al., 2019), and so on. To mitigate these specific biases, two distinct approaches have been developed: data distribution-related and model training-related (Shah et al., 2020; Parraga et al., 2022; Li et al., 2023a). In the data distribution-related approach, efforts are made to re-sample, weight, or generate data to counteract bias (Dixon et al., 2018; Pruksachatkun et al., 2021; Qian et al., 2022). In contrast, model training-related methods explore adversarial techniques, causality (Cadene et al., 2019; Zhu et al., 2022), disentanglement, and additional auxiliary modules to mitigate bias.

3 Methodology

In this section, we describe LASS, the data augmentation debias method via both LLMs and a small generator, a disentangle autoencoder. As Figure 1 shows, the overall architecture of LASS contains three processes, pair data creation via LLMs, Dis-AE model training, and data reproduction via Dis-AE. We first employ the LLMs with manual demonstrations to obtain pairs of counterfactual data. Then, based on pair samples, the disentanglement reconstruction model, Dis-AE, is elaborated in Section 3.2 with the training. Finally, large-scale negative reviews are generated by data reproduction based on Dis-AE.

3.1 Pair Data Creation via LLM

To avoid generating negative reviews that contain unreasonable product information, we obtain synthetic data by rewriting the original positive text. Adhering to the principle of minimal modification, synthetic data with the opposite sentiment but identical content is generated through LLMs via prompt with manual demonstration. Then the synthetic and the original form counterfactual data pairs, which are used to train the disentanglement generator.

3.1.1 Prompt Design

We first devised a foundational prompt to leverage the in-context learning capabilities of LLM for obtaining emotional opposite reviews. Then we enhance the prompt design by incorporating human-annotated samples and revising the order of examples in the prompts.

Formally, our foundational prompt is defined as a demonstration set P𝑃Pitalic_P, comprising a task instruction D𝐷Ditalic_D and k𝑘kitalic_k demonstration examples. Thus, we have P={D,s(x1,y1),,s(xk,yk)}𝑃𝐷𝑠subscript𝑥1subscript𝑦1𝑠subscript𝑥𝑘subscript𝑦𝑘P=\{D,s(x_{1},y_{1}),\cdots,s(x_{k},y_{k})\}italic_P = { italic_D , italic_s ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , italic_s ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) }, where s(xi,yi)𝑠subscript𝑥𝑖subscript𝑦𝑖s(x_{i},y_{i})italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes an pairwise example of emotional counterfactuals. Specifically, we define task instruction D𝐷Ditalic_D as "Your task is to generate a counterfactual that retains internal coherence and avoids unnecessary changes." and randomly select k𝑘kitalic_k samples from counterfactually-augmented movie reviews dataset (Kaushik et al., 2020), where k=5𝑘5k=5italic_k = 5. Furthermore, we designate the temperature parameter as T=0.2𝑇0.2T=0.2italic_T = 0.2 to encourage a more deterministic output from the language model.

The foundational prompts are already capable of enabling LLMs to flexibly generate counterfactuals, for example, when given the input "Jose’s bandana must be giving him superpowers when he’s cooking!!", the model generates the counterfactual as "maybe Jose’s bandana is covering his eyes when he’s cooking!!". However, there are still shortcomings in its performance, specifically manifested as incomplete transformations, where some positive text is retained, and illogical text (such as "disliking but frequently visiting"). Therefore, specific examples from corresponding datasets should be added to the data-specific prompt. We started with a collection of rewrite failure examples based on manual evaluations. Then a small evaluation dataset \mathcal{I}caligraphic_I is constructed via random selection, which consists of m𝑚mitalic_m raw reviews with transformation issues and n𝑛nitalic_n reviews corresponding to reasonable transformations.

Afterward, we use an iterative approach to improve the prompt by observing the success rate of LLMs on the test set after adding manually annotated examples without any previous examples. Specifically, we randomly select review xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from set \mathcal{I}caligraphic_I and obtained example s(xt,yt)𝑠subscript𝑥𝑡subscript𝑦𝑡s(x_{t},y_{t})italic_s ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) manually. Then, we insert s(xt,yt)𝑠subscript𝑥𝑡subscript𝑦𝑡s(x_{t},y_{t})italic_s ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) into the current sequence of examples 𝒞𝒞\mathcal{C}caligraphic_C. The success rate of the LLMs in rewriting samples on the test set {\mathcal{I}caligraphic_I - 𝒞𝒞\mathcal{C}caligraphic_C} determines the best insertion position. Optimization stops when the improvement in success rate after adding new examples is less than ε𝜀\varepsilonitalic_ε, or when the overall success rate reaches δ𝛿\deltaitalic_δ. The more detailed steps of the procedure and final prompt are in appendix C and D.

Refer to caption
Figure 2: The architecture of disentanglement Model, Dis-AE. E𝐸Eitalic_E and D𝐷Ditalic_D are the encoder and the decoder. yepsubscriptsuperscript𝑦𝑝𝑒y^{p}_{e}italic_y start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, yensubscriptsuperscript𝑦𝑛𝑒y^{n}_{e}italic_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and yrsubscript𝑦𝑟y_{r}italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are the emotion labels corresponding to the input xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and label representation. C𝐶Citalic_C is a sentiment classifier. M𝑀Mitalic_M is the number of categories for emotion classification.

3.2 Data Reproduction via Dis-AE

In this section, we describe data reproduction via the disentanglement reconstruction autoencoder, Dis-AE. We first describe the detailed components of the generator. Then, present the calculation process of Dis-AE and explain how to train it. Finally, we introduce the data enhancement process, Data Reproduction through Dis-AE.

Given a set of text pairs (user reviews) with the same content but opposite emotional polarity, the aim of Dis-AE is to reconstruct the input pairs. As Figure 2 shows, the overall architecture of Dis-AE contains three components, an encoder pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, an emotional classifier C𝐶Citalic_C, and a decoder qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.

The Encoder pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT Iso et al. (2021) show that large pre-training language models such as BERT Kenton and Toutanova (2019) and GPT-2 Radford et al. (2019) do not show a significant performance advantage over more lightweight model structures in unsupervised opinion summarization. Therefore, we employ the BIMEANVAE model Iso et al. (2021) which uses BiLSTM as encoder pθ(ze,znx)subscript𝑝𝜃subscript𝑧𝑒conditionalsubscript𝑧𝑛𝑥p_{\theta}(z_{e},z_{n}\mid x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_x ) and applies a mean pooling layer to the BiLSTM layer to obtain the primitive text representation hhitalic_h. Afterward, sentiment representation zesubscript𝑧𝑒z_{e}italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and content representation znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are obtained separately through different self-attention layers.

The Emotional Classifier M𝑀Mitalic_M The sentiment vectors zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and content vectors zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are fed into classifier C𝐶Citalic_C separately. The prediction result of emotion representation zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT should be corresponding labels yesubscript𝑦𝑒y_{e}italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, corresponding to 5 and 1 in the sentiment rating of the dataset. The prediction result of content representation does not contain sentiment information and should be uniform distribution 𝒰(0,M)𝒰0𝑀\mathcal{U}(0,M)caligraphic_U ( 0 , italic_M ), where M𝑀Mitalic_M is the number of categories for emotion classification.

The Decoder qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT Following Iso et al. (2021), LSTM is employed as the decoder qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. The distribution qφ(xz)subscript𝑞𝜑conditional𝑥𝑧q_{\varphi}(x\mid z)italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_x ∣ italic_z ) is computed by the reconstruction of the input xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT from zpsuperscript𝑧𝑝z^{p}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT or z~psuperscript~𝑧𝑝\widetilde{z}^{p}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.

3.2.1 Training of Dis-AE

A pair of texts xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is given which include consistent content and opposite emotional polarity yesubscript𝑦𝑒y_{e}italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. In the training stage, the positive text xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is passed to the encoder pθ(ze,znx)subscript𝑝𝜃subscript𝑧𝑒conditionalsubscript𝑧𝑛𝑥p_{\theta}(z_{e},z_{n}\mid x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_x ) to get two types of text representation, the sentiment zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and the content zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Similarly, zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be obtained for xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Since the content of the paired texts is similar, but the emotion is opposite. Their content representations zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are constrained to resemble each other, while their emotional representations zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are forced to distance themselves.

These representations are all fed are put into the same emotional classifier C𝐶Citalic_C. To ensure that the emotion representation contains as little content information as possible, a learnable emotion label representation set Zrsubscript𝑍𝑟Z_{r}italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is used to replace zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Zrsubscript𝑍𝑟Z_{r}italic_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT also constrains by emotion classification loss Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and contains M𝑀Mitalic_M emotion label representations, where M𝑀Mitalic_M is the number of categories for emotion classification. M𝑀Mitalic_M is the number of categories for emotion classification. Based on the emotion distribution y^epsubscriptsuperscript^𝑦𝑝𝑒\hat{y}^{p}_{e}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and y^ensubscriptsuperscript^𝑦𝑛𝑒\hat{y}^{n}_{e}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT obtained by the corresponding emotion representation, the representation set Zrsuperscript𝑍𝑟Z^{r}italic_Z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is weighted to get the final emotion representation z~epsubscriptsuperscript~𝑧𝑝𝑒\widetilde{z}^{p}_{e}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and z~ensubscriptsuperscript~𝑧𝑛𝑒\widetilde{z}^{n}_{e}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

Then the document latent variable zpsuperscript𝑧𝑝z^{p}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is obtained by concatenating z~epsubscriptsuperscript~𝑧𝑝𝑒\widetilde{z}^{p}_{e}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which is used to reconstruct the input text xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT through the decoder qϕ(xz)subscript𝑞italic-ϕconditional𝑥𝑧q_{\phi}(x\mid z)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ∣ italic_z ). Since pairs of text have similar content representations, combining another content representation zcesubscriptsuperscript𝑧𝑒𝑐z^{e}_{c}italic_z start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT should also represent the current text xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Thus, positive counterfactual representations z~psuperscript~𝑧𝑝\widetilde{z}^{p}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT are obtained by a combination of z~epsubscriptsuperscript~𝑧𝑝𝑒\widetilde{z}^{p}_{e}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which is decoded to obtain xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Similarly, z~ensubscriptsuperscript~𝑧𝑛𝑒\widetilde{z}^{n}_{e}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is combined separately with zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and decoded to obtain xcsuperscript𝑥𝑐x^{c}italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

In order to ensure the basic ability of text generation, we retained the AE constraints, the reconstruction loss Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT. When reconstructing the input pair separately, representation zpsuperscript𝑧𝑝z^{p}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT from concatenated zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is used as the input of the decoder to reconstruct the input text xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. The same procedure is applied to obtain the corresponding negative text xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The reconstruction loss is defined as:

Lrec(θ,ϕ)=subscript𝐿𝑟𝑒𝑐𝜃italic-ϕabsent\displaystyle L_{rec}(\theta,\phi)=italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ) = (1)
i=1N𝔼pθ(z~ep,zcpxp)[logqϕ(xpz~ep,zcp)]superscriptsubscript𝑖1𝑁subscript𝑝𝜃subscriptsuperscript~𝑧𝑝𝑒conditionalsubscriptsuperscript𝑧𝑝𝑐superscript𝑥𝑝𝔼delimited-[]subscript𝑞italic-ϕconditionalsuperscript𝑥𝑝subscriptsuperscript~𝑧𝑝𝑒subscriptsuperscript𝑧𝑝𝑐\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{p}_{e},z^% {p}_{c}\mid x^{p}\right)}{\mathbb{E}}[\log q_{\phi}(x^{p}\mid\widetilde{z}^{p}% _{e},z^{p}_{c})]- ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG blackboard_E end_ARG [ roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∣ over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ]
i=1N𝔼pθ(z~en,zcnxn)[logqϕ(xnz~en,zcn)],superscriptsubscript𝑖1𝑁subscript𝑝𝜃subscriptsuperscript~𝑧𝑛𝑒conditionalsubscriptsuperscript𝑧𝑛𝑐superscript𝑥𝑛𝔼delimited-[]subscript𝑞italic-ϕconditionalsuperscript𝑥𝑛subscriptsuperscript~𝑧𝑛𝑒subscriptsuperscript𝑧𝑛𝑐\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{n}_{e},z^% {n}_{c}\mid x^{n}\right)}{\mathbb{E}}[\log q_{\phi}(x^{n}\mid\widetilde{z}^{n}% _{e},z^{n}_{c})],- ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG blackboard_E end_ARG [ roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ] ,

where θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ are the parameters of the model. The reconstruction loss improves the quality of the decoded text and forces the text representation to store content information with emotion. To disentangle emotional representation and content representation, we employ an emotional auxiliary constrain emo=Le+Ln+Lrsubscript𝑒𝑚𝑜subscript𝐿𝑒subscript𝐿𝑛subscript𝐿𝑟\mathcal{L}_{emo}=L_{e}+L_{n}+L_{r}caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_o end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, which is including with emotion classification constraints Lesubscript𝐿𝑒L_{e}italic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, emotion adversarial constraints Lcsubscript𝐿𝑐L_{c}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and label emotion constraints Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

The sentiment representation zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and content representation zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are fed into classifier C𝐶Citalic_C separately. The prediction result of zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT should be the corresponding emotion label yesubscript𝑦𝑒y_{e}italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which is a cross-entropy loss:

Le(θ)=subscript𝐿𝑒𝜃absent\displaystyle L_{e}(\theta)=italic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_θ ) = 𝔼pθ(zep)i=1Myeplog(p(y^ep|zep))subscript𝔼subscript𝑝𝜃subscriptsuperscript𝑧𝑝𝑒superscriptsubscript𝑖1𝑀subscriptsuperscript𝑦𝑝𝑒𝑙𝑜𝑔𝑝conditionalsubscriptsuperscript^𝑦𝑝𝑒subscriptsuperscript𝑧𝑝𝑒\displaystyle-{\mathbb{E}_{p_{\theta}\left(z^{p}_{e}\right)}}\sum_{i=1}^{M}y^{% p}_{e}log(p(\hat{y}^{p}_{e}|z^{p}_{e}))- blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ) (2)
𝔼pθ(zen)i=1Myenlog(p(y^en|zen)).subscript𝔼subscript𝑝𝜃subscriptsuperscript𝑧𝑛𝑒superscriptsubscript𝑖1𝑀subscriptsuperscript𝑦𝑛𝑒𝑙𝑜𝑔𝑝conditionalsubscriptsuperscript^𝑦𝑛𝑒subscriptsuperscript𝑧𝑛𝑒\displaystyle-{\mathbb{E}_{p_{\theta}\left(z^{n}_{e}\right)}}\sum_{i=1}^{M}y^{% n}_{e}log(p(\hat{y}^{n}_{e}|z^{n}_{e})).- blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ) .

Inspired by Pergola et al. (2021), rather than being unable to achieve correct classification, we assume that content representations zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are sentiment-neutral, and should not exhibit any category bias during sentiment classification. Therefore, zcpsubscriptsuperscript𝑧𝑝𝑐z^{p}_{c}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zcnsubscriptsuperscript𝑧𝑛𝑐z^{n}_{c}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT should be fed into the sentiment classifier to obtain a uniform sentiment classification distribution, which is an expected KL divergence loss:

Ln(θ)=𝔼pθ(zcp)[𝔻KL(𝒰(0,M)||p(y^cp|zcp))]\displaystyle L_{n}(\theta)=-{\mathbb{E}_{p_{\theta}\left(z^{p}_{c}\right)}}[% \mathbb{D}_{KL}(\mathcal{U}(0,M)||p(\hat{y}^{p}_{c}|z^{p}_{c}))]italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( caligraphic_U ( 0 , italic_M ) | | italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ] (3)
𝔼pθ(zcn)[𝔻KL(𝒰(0,M)||p(y^cn|zcn))],\displaystyle-{\mathbb{E}_{p_{\theta}\left(z^{n}_{c}\right)}}[\mathbb{D}_{KL}(% \mathcal{U}(0,M)||p(\hat{y}^{n}_{c}|z^{n}_{c}))],- blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( caligraphic_U ( 0 , italic_M ) | | italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ] ,

where M𝑀Mitalic_M is the total number of sentiment classes. The former is the expected KL divergence with the uniform distribution 𝒰(0,M)𝒰0𝑀\mathcal{U}(0,M)caligraphic_U ( 0 , italic_M ). Given that an additional learnable label representation set Zr={z1r,,zMr}superscript𝑍𝑟subscriptsuperscript𝑧𝑟1subscriptsuperscript𝑧𝑟𝑀Z^{r}=\{z^{r}_{1},\cdots,z^{r}_{M}\}italic_Z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = { italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } is used to replace the emotion representations zepsubscriptsuperscript𝑧𝑝𝑒z^{p}_{e}italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and zensubscriptsuperscript𝑧𝑛𝑒z^{n}_{e}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, Zrsuperscript𝑍𝑟Z^{r}italic_Z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT also need to contain emotional information constrained by a similar loss of emotional classification:

Lr=subscript𝐿𝑟absent\displaystyle L_{r}=italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = i=1Myirlog(p(y^ir|zir)).superscriptsubscript𝑖1𝑀subscriptsuperscript𝑦𝑟𝑖𝑙𝑜𝑔𝑝conditionalsubscriptsuperscript^𝑦𝑟𝑖subscriptsuperscript𝑧𝑟𝑖\displaystyle-\sum_{i=1}^{M}y^{r}_{i}log(p(\hat{y}^{r}_{i}|z^{r}_{i})).- ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (4)

To further introduce relational knowledge hidden in pairs of data, we add distance loss dissubscript𝑑𝑖𝑠\mathcal{L}_{dis}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT and counterfactual reconstruction loss cfsubscript𝑐𝑓\mathcal{L}_{cf}caligraphic_L start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT. The distance loss is based on the prior knowledge that the input text pair expresses opposite emotions but shares similar content. The represented distance is constrained based on the sentence similarity:

Ldis=2+sim(zep,zen)sim(zcp,zcn),subscript𝐿𝑑𝑖𝑠2𝑠𝑖𝑚subscriptsuperscript𝑧𝑝𝑒subscriptsuperscript𝑧𝑛𝑒𝑠𝑖𝑚subscriptsuperscript𝑧𝑝𝑐subscriptsuperscript𝑧𝑛𝑐\displaystyle L_{dis}=2+sim(z^{p}_{e},z^{n}_{e})-sim(z^{p}_{c},z^{n}_{c}),italic_L start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT = 2 + italic_s italic_i italic_m ( italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) - italic_s italic_i italic_m ( italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (5)

where sim()𝑠𝑖𝑚sim(\cdot)italic_s italic_i italic_m ( ⋅ ) indicates the cosine similarity function. Likewise, since the text pair xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT contain the same content information, the alternate content representation should allow for successful decoding of the corresponding text. Thus the counterfactual reconstruction loss is:

Lcf(θ,ϕ)=subscript𝐿𝑐𝑓𝜃italic-ϕabsent\displaystyle L_{cf}(\theta,\phi)=italic_L start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ) = (6)
i=1N𝔼pθ(z~ep,zcpxp)pθ(z~en,zcnxn)[logqϕ(xpz~ep,zcn)]superscriptsubscript𝑖1𝑁subscript𝑝𝜃subscriptsuperscript~𝑧𝑝𝑒conditionalsubscriptsuperscript𝑧𝑝𝑐superscript𝑥𝑝subscript𝑝𝜃subscriptsuperscript~𝑧𝑛𝑒conditionalsubscriptsuperscript𝑧𝑛𝑐superscript𝑥𝑛𝔼delimited-[]subscript𝑞italic-ϕconditionalsuperscript𝑥𝑝subscriptsuperscript~𝑧𝑝𝑒subscriptsuperscript𝑧𝑛𝑐\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{p}_{e},z^% {p}_{c}\mid x^{p}\right)p_{\theta}\left(\widetilde{z}^{n}_{e},z^{n}_{c}\mid x^% {n}\right)}{\mathbb{E}}[\log q_{\phi}(x^{p}\mid\widetilde{z}^{p}_{e},z^{n}_{c})]- ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG blackboard_E end_ARG [ roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∣ over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ]
i=1N𝔼pθ(z~en,zcnxn)pθ(z~ep,zcpxp)[logqϕ(xnz~en,zcp)].superscriptsubscript𝑖1𝑁subscript𝑝𝜃subscriptsuperscript~𝑧𝑛𝑒conditionalsubscriptsuperscript𝑧𝑛𝑐superscript𝑥𝑛subscript𝑝𝜃subscriptsuperscript~𝑧𝑝𝑒conditionalsubscriptsuperscript𝑧𝑝𝑐superscript𝑥𝑝𝔼delimited-[]subscript𝑞italic-ϕconditionalsuperscript𝑥𝑛subscriptsuperscript~𝑧𝑛𝑒subscriptsuperscript𝑧𝑝𝑐\displaystyle-\sum_{i=1}^{N}\underset{p_{\theta}\left(\widetilde{z}^{n}_{e},z^% {n}_{c}\mid x^{n}\right)p_{\theta}\left(\widetilde{z}^{p}_{e},z^{p}_{c}\mid x^% {p}\right)}{\mathbb{E}}[\log q_{\phi}(x^{n}\mid\widetilde{z}^{n}_{e},z^{p}_{c}% )].- ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG blackboard_E end_ARG [ roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ] .

Our final objective function is:

=Lrec+αemo+βdis+γcf,subscript𝐿𝑟𝑒𝑐𝛼subscript𝑒𝑚𝑜𝛽subscript𝑑𝑖𝑠𝛾subscript𝑐𝑓\mathcal{L}=L_{rec}+\alpha\mathcal{L}_{emo}+\beta\mathcal{L}_{dis}+\gamma% \mathcal{L}_{cf},caligraphic_L = italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_o end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT , (7)

where α𝛼\alphaitalic_α, β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are hyper-parameters that controls the strength of constrains.

3.2.2 Data Reproduction

After training, data reproduction can be performed by selecting parent samples from the training set and combining them with the disentanglement model Dis-AE. Specifically, when negative reviews for a specific product are needed, positive reviews for that product are selected along with any negative reviews as parents. The parent samples are inputted into Dis-AE to obtain sentiment representations and content representations separately. By combining the content representation of positive reviews with the sentiment representation of negative reviews, we obtain the child representation. Decoding the child representation yields negative samples. This data reproduction approach ensures the controllability of content and sentiment of generated text while also meeting the demand for large-scale data augmentation, due to the diversity of parental sample combinations.

Due to the limitation of small model generation ability, the generated text may be unreadable, or with incorrect sentiment polarity. Therefore, we add a data filtering process based on perplexity and sentiment classification to ensure the quality of the generated text.

4 Experiments

4.1 Datasets

We performed experiments on two opinion summarization benchmarks, the Amazon dataset (Bražinskas et al., 2020) and Yelp Chu and Liu (2019). All datasets include review ratings with a 1–5 scale which we used as sentiment labels. Besides training reviews, these two datasets also contain gold-standard summaries for 200 and 60 sampled objects for evaluation.

However, extreme sentiment biases also exist in the evaluation data. Therefore, we extracted 800 positive and 800 negative products from the training data of both datasets. Half for the validation, and the other half for the test. Each product consists of 7 or 8 reviews, all rated as 5 for positive or 1 for negative sentiment. Due to the consistent sentiment polarity of reviews, we utilized them for assessing the ability of summary generation to produce summaries with different sentiment polarities for positive (POS) and negative products (NEG).

Amazon Yelp
Pos Neg Pos Neg
(%) Rev Sen Dif Rev Sen Dif Rev Sen Dif Rev Sen Dif
Wassos(T) 93.2593.2593.2593.25 88.9788.9788.9788.97 - 20.6320.6320.6320.63 19.8419.8419.8419.84 - 98.2598.2598.2598.25 91.5191.5191.5191.51 - 43.543.543.543.5 47.2547.2547.2547.25 -
Wassos(O) 93.593.593.593.5 92.4992.4992.4992.49 - 7.137.137.137.13 10.3110.3110.3110.31 - 79.2579.2579.2579.25 78.9378.9378.9378.93 - 59.2559.2559.2559.25 53.2853.2853.2853.28 -
TRACE(a) 91.6391.6391.6391.63 82.2982.2982.2982.29 - 24.3824.3824.3824.38 29.6129.6129.6129.61 - 100100100100 94.5394.5394.5394.53 - 68.568.568.568.5 57.0857.0857.0857.08 -
TRACE 89.2589.2589.2589.25 80.9480.9480.9480.94 - 40.540.540.540.5 38.8238.8238.8238.82 - 99.599.599.599.5 97.4497.4497.4497.44 - 8.58.58.58.5 10.9210.9210.9210.92 -
Copycat 93.75 84.69 - 16.2516.2516.2516.25 16.4016.4016.4016.40 - 97.75 88.43 - 47.7547.7547.7547.75 41.1541.1541.1541.15 -
 +GPT 60.9560.9560.9560.95 57.0057.0057.0057.00 30.330.3-30.3- 30.3 70.6370.6370.6370.63 55.0955.0955.0955.09 +46.546.5+46.5+ 46.5 95.0095.0095.0095.00 76.7176.7176.7176.71 -7.2 78.1378.1378.1378.13 63.9663.9663.9663.96 +26.626.6+26.6+ 26.6
 +LASS 61.1361.1361.1361.13 64.0164.0164.0164.01 -26.7 76.75 58.34 +51.2 93.3893.3893.3893.38 76767676 8.48.4-8.4- 8.4 86.5 64.02 +30.8
Coop(a) 81.7581.7581.7581.75 76.0576.0576.0576.05 - 46.8846.8846.8846.88 41.3941.3941.3941.39 - 99.87599.87599.87599.875 93.2393.2393.2393.23 - 34343434 39.3139.3139.3139.31 -
 +GPT 94.38 88.2688.2688.2688.26 +12.4 90.88 79.68 +41.1 99.6399.6399.6399.63 94.9894.9894.9894.98 +0.70.7+0.7+ 0.7 77.5077.5077.5077.50 73.48 +38.838.8+38.8+ 38.8
 +LASS 92.7592.7592.7592.75 86.42 +10.710.7+10.7+ 10.7 89.3889.3889.3889.38 74.1474.1474.1474.14 +37.637.6+37.6+ 37.6 99.75 97.26 +2.0 79.25 72.3772.3772.3772.37 +39.2
Coop 82.7582.7582.7582.75 76.9776.9776.9776.97 - 58585858 47.6447.6447.6447.64 - 99999999 92.3892.3892.3892.38 - 51.551.551.551.5 47.5547.5547.5547.55 -
 +GPT 90.6390.6390.6390.63 81.4081.4081.4081.40 +6.26.2+6.2+ 6.2 93 76.48 +31.9 100 95.52 +2.1 93.38 80.86 +37.6
 +LASS 90.75 82.38 +6.7 84.8884.8884.8884.88 68.8968.8968.8968.89 +24.124.1+24.1+ 24.1 99.7599.7599.7599.75 95.4195.4195.4195.41 +1.91.9+1.9+ 1.9 90.2590.2590.2590.25 77.1377.1377.1377.13 +34.234.2+34.2+ 34.2
Table 2: Sentiment accuracy results on Amazon and Yelp. The bold scores denote the best scores. On Amazon, The amount of data enhanced via GPT for Copycat, Coop (a), and Coop are 450k, 360k, and 360k respectively, while on Yelp, they are 630k, 450, and 540k respectively. The synthetic data used to train the Dis-AE model in LASS is 200k, consistent across all models and datasets.
Amazon Yelp
R1 R2 RL R1 R2 RL
Wassos(T) 29.729.729.729.7 6.56.56.56.5 20.020.020.020.0 30.830.830.830.8 5.95.95.95.9 18.318.318.318.3
Wassos(O) 32.532.532.532.5 7.27.27.27.2 21.821.821.821.8 26.626.626.626.6 4.54.54.54.5 16.416.416.416.4
TRACE(a) 33.733.733.733.7 6.36.36.36.3 20.520.520.520.5 32.632.632.632.6 6.66.66.66.6 20.020.020.020.0
TRACE 36.036.036.036.0 7.27.27.27.2 20.820.820.820.8 33.933.933.933.9 6.86.86.86.8 19.719.719.719.7
Copycat 31.931.931.931.9 6.1 20.4 29.329.329.329.3 5.45.45.45.4 17.717.717.717.7
 +GPT 32.3 5.95.95.95.9 19.719.719.719.7 30.0 5.65.65.65.6 18.818.818.818.8
 +LASS 31.831.831.831.8 5.85.85.85.8 19.519.519.519.5 29.429.429.429.4 6.0 19.2
Coop(a) 32.132.132.132.1 5.15.15.15.1 18.118.118.118.1 30.630.630.630.6 5.95.95.95.9 18.818.818.818.8
 +GPT 32.232.232.232.2 7.1 20.220.220.220.2 31.6 6.46.46.46.4 19.519.519.519.5
 +LASS 32.9 6.36.36.36.3 20.4 31.631.631.631.6 6.9 19.6
Coop 35.735.735.735.7 6.26.26.26.2 19.819.819.819.8 34.5 6.9 19.6
 +GPT 35.635.635.635.6 6.46.46.46.4 20.620.620.620.6 34.034.034.034.0 6.86.86.86.8 19.519.519.519.5
 +LASS 36.2 7.0 21.4 33.833.833.833.8 6.96.96.96.9 19.419.419.419.4
Table 3: Rouge scores on Amazon and Yelp. The bold scores denote the best scores.

4.2 Evaluation Metrics and Baselines

We evaluate summary systems with the classical ROUGE-1, 2, L metrics (Lin, 2004). We also report sentiment precision about the positive and the negative at the sentence level (Sen) and review level (Rev), using the sentiment analysis model from Stanza (Qi et al., 2020) to compute. All ratings are normalized to scores between 0 and 1. More details are in the Appendix. The term "Dif" represents the average change in sentiment accuracy at both the review and sentence levels after data augmentation using GPT or LASS.

Following prior work Iso et al. (2021); Song et al. (2022), we compare with Copycat Bražinskas et al. (2020), Coop Iso et al. (2021), Wassos Song et al. (2022) and TRACE Zhang and Zhou (2023). (a), (O), and (T) represent different clustering strategies for the model. The detailed introduction is in the Appendix. Considering the sensitivity of the counter-templates in TRACE to training data, we experimented with data augmentation methods based on Chatgpt and LASS on three models Coop, Coop(a), and Copycat.

4.3 Implementation Details

In this work, we employ the ChatGPT platform 111https://chat.openai.com/chat to generate pairwise emotional counterfactuals within a crafted prompt setting. For the prompt optimization, m=40𝑚40m=40italic_m = 40, n=10𝑛10n=10italic_n = 10, δ=80%𝛿percent80\delta=80\%italic_δ = 80 % and ε=10%𝜀percent10\varepsilon=10\%italic_ε = 10 %.The final prompts include 5 pairs of examples for the Amazon dataset and 7 pairs for Yelp. Specifically, we extract the samples with a sentiment score of 5 from the training data.

Refer to caption
Figure 3: Experimental results about Coop and Copycat with different sizes of synthesized data from GPT or LASS on Amazon and Yelp. The horizontal axis represents the amount of synthetic data added, measured in increments of 10k. For example, 9 corresponds to 90k.

For the disentanglement model Dis-AE, we used Adam optimizer  Kingma and Ba (2015) with a linear scheduler, whose initial learning rate is set to 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. For beam search in the generation, the beam size is set to 4 and a max token size of 70. The amount of training data used is 200k, according to the analysis in Section 4.5. Additionally, based on the PPL testing conducted on the training set, we set the threshold for PPL at 125. Only generated samples with PPL less than 125 and classified as negative by review level sentiment classifier from Stanza (Qi et al., 2020) are retained. To prevent the imbalance of multiple constraints from undermining the text generation capability, we mimic KL annealing Li et al. (2019); Iso et al. (2021) to gradually increase α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ from 0 during training. The upper limit for the weight of sentiment loss α𝛼\alphaitalic_α is set to 5, while β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are both limited to 1. All experiments were conducted on NVIDIA GeForce RTX 3090 or NVIDIA Tesla V100.

4.4 Results

According to table 2, Synthesized data from both LASS and GPT significantly enhance the model’s performance in nearly all sentiment accuracy measures, whether at the review or sentence level. The exception is the copycat model, despite improving negative sentiment accuracy, harms positive sentiment accuracy when augmenting negative sentiment data. However, comparatively, LASS improved negative sentiment accuracy more than GPT. And LASS maintains a higher positive sentiment accuracy for the Amazon dataset.

This kind of exception may be attributed to the multiple influences of data augmentation methods, summarization models, and datasets. From the perspective of summarization models, the overall performance of the Copycat model is inferior to that of Coop(a) and Coop in terms of both sentiment accuracy and ROUGE scores. For positive sentiment accuracy, any model’s performance on the Yelp dataset as a whole is better than that on Amazon. This may be because the Yelp data mainly consists of restaurant reviews, making it easier for models to learn expressions of positivity and negativity compared to the diverse product types in the Amazon data.

From the ROUGE scores in Table 3, it was observed that all methods did not exhibit a performance decrease after data augmentation using GPT or LASS. This suggests that the data augmentation methods are applicable across different models and do not degrade the performance of the models on the original task. It also indicates that both the GPT and LASS methods generate highly readable data, and even with a large-scale addition to the training data, they do not disrupt the training of summarization tasks.

Num(k)𝑁𝑢𝑚𝑘Num(k)italic_N italic_u italic_m ( italic_k ) 𝐏𝐏𝐋𝐏𝐏𝐋absent\textbf{PPL}\downarrowPPL ↓ R1 R2 RL
50505050 540.25540.25540.25540.25 53.6153.6153.6153.61 16.8016.8016.8016.80 34.7434.7434.7434.74
100100100100 1314.501314.501314.501314.50 60.5760.5760.5760.57 23.3123.3123.3123.31 42.6542.6542.6542.65
150150150150 788.96788.96788.96788.96 56.2556.2556.2556.25 34.0834.0834.0834.08 50.9250.9250.9250.92
200200200200 360.13360.13360.13360.13 59.5559.5559.5559.55 38.8538.8538.8538.85 54.8654.8654.8654.86
250250250250 403.57403.57403.57403.57 59.2959.2959.2959.29 39.3639.3639.3639.36 54.4454.4454.4454.44
Table 4: Experimental results about Dis-AE with different sizes of train data on Amazon.

4.5 Analysis

To investigate the impact of different synthetic data on summarization models, we analyzed the sentiment accuracies of different summarization models using varying amounts of augmentation data from GPT or LASS, as shown in Figure 3. Overall, adding negative reviews can improve the negative sentiment accuracy of summaries, while may affect the ability to generate positive summaries to some extent. For Coop, the positive accuracy on Amazon shows some instability as the data volume increases. Meanwhile, Copycat’s positive accuracy experiences a significant decline, suggesting that Copycat may not handle sentiment information well in summaries and tends to generate neutral text with mixed positive and negative sentiments.

Additionally, we explored the amount of data required for training Dis-AE. Evaluating whether the quality of the generated text meets the training requirements of summarization requires a lot of downstream experiments. To more efficiently confirm the data requirements, we employ two metrics: perplexity (PPL) and counterfactual reconstruction ROUGE score. The counterfactual reconstruction ROUGE score is similar to the counterfactual reconstruction loss Lcfsubscript𝐿𝑐𝑓L_{cf}italic_L start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT, calculating the ROUGE score of reconstructed text after exchanging paired counterfactual samples with target text. PPL relies on GPT-2 to compute the degree of text fluency 222https://huggingface.co/docs/transformers/perplexity.

The results, as shown in the table 4, indicate that the quality of generation improves steadily with the increase in data volume, with instabilities observed after reaching 200k. The reason why the PPL for 50k is less than that for 100k is because samples shorter than 10 characters are not included in the PPL calculation, as PPL becomes erratic for excessively short texts.

5 Limitation

Overall, while debias through data augmentation can generalize across different models, its effectiveness is also limited by the performance and characteristics of each model. For example, in the current scenario, the Copycat model experienced significant degradation in positive sentiment accuracy after using augmented data on the Amazon dataset. For another model TRACE, changes in data distribution significantly affect the performance of the summaries, as observed in our preliminary experiments. This may be attributed to one of the parameters, the counter-template, being sensitive to the training data. Additionally, determining the minimum data required for Dis-AE training is a critical issue. The current approach, based on perplexity and counterfactual reconstruction metrics, only indirectly reflects the quality of generated counterfactual texts. We will continue to explore the training data requirement for Dis-AE in future work.

6 Acknowledgements

We would like to thank anonymous reviewers for their valuable comments and helpful suggestions. The authors acknowledge financial support from the National Natural Science Foundation of China (62176053). This research work is also supported by the Big Data Computing Center of Southeast University. YH was supported by a Turing AI Fellowship (EP/V020579/1, EP/V020579/2) funded by the UK Research and Innovation.

References

  • Abaskohi et al. (2023) A. Abaskohi, S. Rothe, and Yaghoobzadeh Y. Lm-cppf. 2023. Paraphrasing-guided data augmentation for contrastive prompt-based few-shot fine-tuning[c]. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 670–681.
  • Amplayo et al. (2021a) Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. 2021a. Aspect-controllable opinion summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  • Amplayo et al. (2021b) Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. 2021b. Unsupervised opinion summarization with content planning. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Amplayo and Lapata (2020) Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Bražinskas et al. (2020) Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised opinion summarization as copycat-review generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Cadene et al. (2019) Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. 2019. Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.
  • Chu and Liu (2019) Eric Chu and Peter Liu. 2019. Meansum: A neural model for unsupervised multi-document abstractive summarization. In International Conference on Machine Learning, pages 1223–1232. PMLR.
  • Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 67–73.
  • Elsafoury et al. (2023) Fatma Elsafoury, Stamos Katsigiannis, and Naeem Ramzan. 2023. On bias and fairness in nlp: How to have a fairer text classification? arXiv e-prints, pages arXiv–2305.
  • Elsahar et al. (2021) Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gallé. 2021. Self-supervised and controlled multi-document opinion summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
  • Gao et al. (2023) Jiahui Gao, Renjie Pi, Lin Yong, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. 2023. Self-guided noise-free data generation for efficient zero-shot learning. In International Conference on Learning Representations (ICLR 2023).
  • Guo et al. (2022) Yue Guo, Yi Yang, and Ahmed Abbasi. 2022. Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1012–1023.
  • Hosking et al. (2023) Tom Hosking, Hao Tang, and Mirella Lapata. 2023. Attributable and scalable opinion summarization. arXiv preprint arXiv:2305.11603.
  • Iso et al. (2021) Hayate Iso, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, and Wang-Chiew Tan. 2021. Convex aggregation for opinion summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3885–3903.
  • Kaushik et al. (2020) Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2020. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations.
  • Ke et al. (2022) Wenjun Ke, **hua Gao, Huawei Shen, and Xueqi Cheng. 2022. Consistsum: Unsupervised opinion summarization with the consistency of aspect, sentiment and semantic. In Proceedings of the fifteenth ACM international conference on web search and data mining, pages 467–475.
  • Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster).
  • Li et al. (2019) Bohan Li, Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick, and Yiming Yang. 2019. A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3603–3614.
  • Li et al. (2023a) Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023a. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149.
  • Li et al. (2023b) Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, and Tieyun Qian. 2023b. Large language models as counterfactual generator: Strengths and weaknesses. arXiv preprint arXiv:2305.14791.
  • Li et al. (2023c) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023c. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Meng et al. (2022) Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems, 35:462–477.
  • Pan et al. (2023) Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. 2023. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
  • Parraga et al. (2022) Otávio Parraga, Martin D More, Christian M Oliveira, Nathan S Gavenski, Lucas S Kupssinskü, Adilson Medronha, Luis V Moura, Gabriel S Simões, and Rodrigo C Barros. 2022. Debiasing methods for fairer neural models in vision and language research: A survey. arXiv preprint arXiv:2211.05617.
  • Pergola et al. (2021) Gabriele Pergola, Lin Gui, and Yulan He. 2021. A disentangled adversarial neural topic model for separating opinions from plots in user reviews. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2870–2883.
  • Pruksachatkun et al. (2021) Yada Pruksachatkun, Satyapriya Krishna, Jwala Dhamala, Rahul Gupta, and Kai-Wei Chang. 2021. Does robustness improve fairness? approaching fairness with word substitution robustness methods for text classification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3320–3331.
  • Qi et al. (2020) Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  • Qian et al. (2022) Rebecca Qian, Candace Ross, Jude Fernandes, Eric Michael Smith, Douwe Kiela, and Adina Williams. 2022. Perturbation augmentation for fairer nlp. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
  • Shah et al. (2020) Deven Santosh Shah, H Andrew Schwartz, and Dirk Hovy. 2020. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5248–5264.
  • Song et al. (2022) Jiayu Song, Iman Munire Bilal, Adam Tsakalidis, Rob Procter, and Maria Liakata. 2022. Unsupervised opinion summarisation in the wasserstein space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8592–8607.
  • Suhara et al. (2020) Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, and Wang-Chiew Tan. 2020. Opiniondigest: A simple framework for opinion summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Wang et al. (2021) Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Yan et al. (2021) Hanqi Yan, Lin Gui, Gabriele Pergola, and Yulan He. 2021. Position bias mitigation: A knowledge-aware graph model for emotion cause extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3364–3375.
  • Ye et al. (2022) Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. 2022. Zerogen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669.
  • Yoo et al. (2021) Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park. 2021. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Zhang and Zhou (2023) Yanyue Zhang and Deyu Zhou. 2023. Disentangling text representation with counter-template for unsupervised opinion summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6344–6357.
  • Zhu et al. (2022) Yongchun Zhu, Qiang Sheng, Juan Cao, Shuokai Li, Danding Wang, and Fuzhen Zhuang. 2022. Generalizing to the future: Mitigating entity bias in fake news detection. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2120–2125.

Appendix A Sentiment Evaluation

For positive reviews, the sentiment score is 1 for positive, 0.5 for neutral, and 0 for negative, while for the negative set, the negative is 1. The rating for the review level precision involves assigning a score to the entire text, while at the sentence level, scores are assigned to each sentence in the text and then averaged.

Appendix B Baselines

We compare our method against the following unsupervised summarization approach. Copycat (Bražinskas et al., 2020) captures the dependency relationship between the product and reviews by defining a hierarchical VAE. Coop (Iso et al., 2021) searches input combinations for the summary aggregation using the input-output word overlap**. a𝑎aitalic_a represents the use of a simple averaging strategy, while the other represents the retrieval strategy of Coop. Wassos (Song et al., 2022) uses the Wasserstein barycenter of the semantic and syntactic distributions to obtain the summary. O𝑂Oitalic_O and T𝑇Titalic_T represent different clustering strategies. TRACE Zhang and Zhou (2023) is based on text representation disentanglement with generated counter-templates. a𝑎aitalic_a represents the use of a simple averaging strategy, while the other represents the retrieval strategy of Coop.

Appendix C Algorithm

Algorithm 1 Prompt Optimization
0:  instruction D𝐷Ditalic_D, test set ={x1,,x||}subscript𝑥1subscript𝑥\mathcal{I}=\{x_{1},\cdots,x_{|\mathcal{I}|}\}caligraphic_I = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT | caligraphic_I | end_POSTSUBSCRIPT }, example permutation 𝒮𝒮\mathcal{S}caligraphic_S, candidate example set 𝒞=𝒞\mathcal{C}=\mathcal{I}caligraphic_C = caligraphic_I, time step t=1𝑡1t=1italic_t = 1.
0:  Optimized Prompt PPt𝑃subscript𝑃𝑡P\leftarrow P_{t}italic_P ← italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
1:  repeat
2:     randomly select review xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from set 𝒞𝒞\mathcal{C}caligraphic_C and obtained example s(xt,yt)𝑠subscript𝑥𝑡subscript𝑦𝑡s(x_{t},y_{t})italic_s ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) manualy.
3:     Insert s(xt,yt)𝑠subscript𝑥𝑡subscript𝑦𝑡s(x_{t},y_{t})italic_s ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) into 𝒮𝒮\mathcal{S}caligraphic_S to earned permutation set {𝒮t1,,𝒮t|s|+1}superscriptsubscript𝒮𝑡1superscriptsubscript𝒮𝑡𝑠1\{\mathcal{S}_{t}^{1},\cdots,\mathcal{S}_{t}^{|s|+1}\}{ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_s | + 1 end_POSTSUPERSCRIPT }, which each permutation contain |𝒮|+1𝒮1|\mathcal{S}|+1| caligraphic_S | + 1 examples.
4:     for  i=1𝑖1i=1italic_i = 1 to |𝒮|+1𝒮1|\mathcal{S}|+1| caligraphic_S | + 1 do
5:        Pti={D,𝒮ti}superscriptsubscript𝑃𝑡𝑖𝐷superscriptsubscript𝒮𝑡𝑖P_{t}^{i}=\{D,\mathcal{S}_{t}^{i}\}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_D , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT };
6:        scoretiscore({𝒮}|Pti)𝑠𝑐𝑜𝑟superscriptsubscript𝑒𝑡𝑖𝑠𝑐𝑜𝑟𝑒conditional𝒮superscriptsubscript𝑃𝑡𝑖score_{t}^{i}\leftarrow score(\{\mathcal{I}-\mathcal{S}\}|P_{t}^{i})italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_s italic_c italic_o italic_r italic_e ( { caligraphic_I - caligraphic_S } | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT );
7:     end for
8:     update permutation 𝒮𝒮\mathcal{S}caligraphic_S: 𝒮=argmax𝒮ti𝒮superscriptsubscript𝒮𝑡𝑖𝑎𝑟𝑔𝑚𝑎𝑥\mathcal{S}=\underset{\mathcal{S}_{t}^{i}}{argmax}caligraphic_S = start_UNDERACCENT caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG scoreti𝑠𝑐𝑜𝑟superscriptsubscript𝑒𝑡𝑖score_{t}^{i}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT;
9:     𝒞={}𝒞\mathcal{C}=\{\}caligraphic_C = { };
10:     add xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into 𝒞𝒞\mathcal{C}caligraphic_C if score(xi|Pt)<0𝑠𝑐𝑜𝑟𝑒conditionalsubscript𝑥𝑖subscript𝑃𝑡0score(x_{i}|P_{t})<0italic_s italic_c italic_o italic_r italic_e ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < 0;
11:     t=t+1𝑡𝑡1t=t+1italic_t = italic_t + 1;
12:  until score({𝒮}|Pt)>δ𝑠𝑐𝑜𝑟𝑒conditional𝒮subscript𝑃𝑡𝛿score(\{\mathcal{I}-\mathcal{S}\}|P_{t})>\deltaitalic_s italic_c italic_o italic_r italic_e ( { caligraphic_I - caligraphic_S } | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_δ or score({𝒮}|Pt)score({𝒮}|Pt1)<ε𝑠𝑐𝑜𝑟𝑒conditional𝒮subscript𝑃𝑡𝑠𝑐𝑜𝑟𝑒conditional𝒮subscript𝑃𝑡1𝜀score(\{\mathcal{I}-\mathcal{S}\}|P_{t})-score(\{\mathcal{I}-\mathcal{S}\}|P_{% t-1})<\varepsilonitalic_s italic_c italic_o italic_r italic_e ( { caligraphic_I - caligraphic_S } | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_s italic_c italic_o italic_r italic_e ( { caligraphic_I - caligraphic_S } | italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) < italic_ε.

the success rate of the LLMs score(S|Pt)𝑠𝑐𝑜𝑟𝑒conditional𝑆subscript𝑃𝑡score(S|P_{t})italic_s italic_c italic_o italic_r italic_e ( italic_S | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) indicates a score evaluating on dataset S={x1,,xk}𝑆subscript𝑥1subscript𝑥𝑘S=\{x_{1},\cdots,x_{k}\}italic_S = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } under prompt Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which defined as:

score(S|Pt)=i=1|S|HumanEval(LLM(xi,Pt)),𝑠𝑐𝑜𝑟𝑒conditional𝑆subscript𝑃𝑡superscriptsubscript𝑖1𝑆𝐻𝑢𝑚𝑎𝑛𝐸𝑣𝑎𝑙𝐿𝐿𝑀subscript𝑥𝑖subscript𝑃𝑡score(S|P_{t})=\sum_{i=1}^{|S|}HumanEval(LLM(x_{i},P_{t})),italic_s italic_c italic_o italic_r italic_e ( italic_S | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT italic_H italic_u italic_m italic_a italic_n italic_E italic_v italic_a italic_l ( italic_L italic_L italic_M ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (8)

where LLM(xi,Pt)𝐿𝐿𝑀subscript𝑥𝑖subscript𝑃𝑡LLM(x_{i},P_{t})italic_L italic_L italic_M ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is LLM’s output given input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and prompt Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. HumanEval𝐻𝑢𝑚𝑎𝑛𝐸𝑣𝑎𝑙HumanEvalitalic_H italic_u italic_m italic_a italic_n italic_E italic_v italic_a italic_l is a score given by human evaluation, whose value belongs to {0,1}01\{0,1\}{ 0 , 1 }, 1 demonstrates conformity to normative standards, and 0 indicates the issues in reasonableness or sentiment polarity after generation.

Appendix D Prompt

Here is the foundational prompt employed to obtain annotated validation datasets for prompt optimization:

Your task is to generate a counterfactual that retains internal coherence and avoids unnecessary changes.

Example: Really good movie. Maybe the best I’ve ever seen. Alien invasion, a la The Blob, with crazy good acting. Meteorite turns beautiful woman into a host body for nasty tongue. Engaging plot, great tongue. Absurd comedy worth watching. Maybe don’t wash your hair or take out the trash but take time out to watch this movie.

Counterfactual: Really bad movie. Maybe the worst I’ve ever seen. Alien invasion, a la The Blob, without the acting. Meteorite turns beautiful woman into a host body for nasty tongue. Bad plot, bad fake tongue. Absurd comedy worth missing. Wash your hair or take out the trash.

Example: I rated this a 5. The dubbing was as good as I have seen. The plot - wow. I’m not sure which made the movie more great. Jet Li is definitely a great martial artist, as good as Jackie Chan.

Counterfactual: I rated this a 3. The dubbing was as bad as I have seen. The plot - yuck. I’m not sure which ruined the movie more. Jet Li is definitely a great martial artist, but I’ll stick to Jackie Chan movies until somebody tells me Jet’s English is up to par.

Example: Greenaway seems to have a habit of trying hard to entertain his viewers. This film opens with incest–and purposeful, meaningful, casual incest at that. That’s Greenaway’s focus. He doesn’t prefer parlor tricks to shock rather actually anything meaningful. Technical skill isn’t enough. He’s a bit perverse for the sake of perversity but it works out well.

Counterfactual: Greenaway seems to have a habit of trying deliberately to disgust his viewers. This film opens with incest–and purposeless, meaningless, casual incest at that. That’s Greenaway’s big problem. He prefers parlor tricks to shock over actually doing anything meaningful. Technical skill isn’t enough. He’s just a bit perverse for the sake of perversity.

Example: This is one of the most awesome movies ever. Shaq better do more movies. This movie just gave me a good bit of life and I will always remember that. I will never make fun of this movie until I die, and then even after! It is just so wonderful and even funny. MST3000 would have a blast with this one.

Counterfactual: This is one of the most god-awful movies ever. Shaq better just stick to basketball. This movie took away apart of my life I will never have back. I will make fun of this movie until I die, and then some. It is so horrible it is not even funny. MST3000 would have a blast with this one.

Example: There’s something wonderful about the fact that a movie made in 1934 can be head and shoulders above every Tarzan movie that followed it, including the bloated and boring 1980s piece Greystoke. Once the viewer gets past the first three scenes, which are admittedly dull, Tarzan and his Mate takes off like a shot, offering non-stop action, humor, and romance. Maureen O’Sullivan is charming and beautiful as Jane and walks off with the movie. Weismuller is solid as well. Highly recommended.

Counterfactual: There’s something awful about the fact that a movie made in 1934 can be head and shoulders below every Tarzan movie that followed it, including the bloated and boring 1980s piece Greystoke. Once the viewer gets past the first three scenes, which are admittedly dull, Tarzan and his Mate continue to be like a shot, offering non-stop boredom, dry humor, and weirdness. Maureen O’Sullivan is mean and ugly as Jane and walks off with the movie. Weismuller is rude as well. Not recommended.

D.1 Added Examples After Prompt Optimization

In Prompt Optimization, we annotated k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT examples from the Amazon dataset and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT examples from the Yelp dataset to gain better performance in the counterfactual generation, where k1=5subscript𝑘15k_{1}=5italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 and k2=7subscript𝑘27k_{2}=7italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 7.

Here are the annotated examples from the Amazon dataset:

Example: I tried connecting my iPhone 4S to my 2012 Ford Focus using a standard 3.5mm audio cable, but it sounded awful and noisy. Instead, I purchased this cable and now the audio going into my car sounds perfect! This is the best $3-5 I could have spent to improve my car audio.

Counterfactual: I tried connecting my iPhone 4S to my 2012 Ford Focus using a standard 3.5mm audio cable, but it sounded awful and noisy. Instead, I purchased this cable and now the audio going into my car still sounds awful! This is the worst $3-5 I could have spent to improve my car audio.

Example: I ordered this for my 3 yr old for Halloween. He loved it!! The candy catcher in the front is really neat, but probably need to take a pail or something else along also because it can get to be heavy if they get a lot of candy. I was very pleased with the way it fit and everything.

Counterfactual: I ordered this for my 3 yr old for Halloween. He prefer another one!! The candy catcher in the front is really small, but probably need to take a pail or something else along also because it can get to be heavy if they get a lot of candy. I was concerned about the way it fit and everything.

Example: I loved this steamer when I got it, and it has remained a very stable item to use. I feel confident taking it out of the microwave when hot because it has never dumped hot food all over me.

Counterfactual: I disliked this steamer when I got it, and it has remained a very unstable item to use. I feel hesitant taking it out of the microwave when hot because it has frequently spilled hot food all over me.

Example: Purse looks great. The bag is cute and flashy but the size is smaller than expected overall. The stones and straps are not very durable and break or fall off easily.

Counterfactual: The purse looks awful. The bag is unattractive and plain but the size is just the expected overall. The stones and straps are just durable and break or fall off not easily.

Example: The tank fit very well and was comfortable to wear. The material was thicker than I expected, and I felt it was a great value for the price. I’ve bought similar quality tanks for $10 at a local store.

Counterfactual: The tank didn’t fit well at all and it was quite uncomfortable to wear. The material was much thinner than I expected, and I felt it was not a good value for the price. I’ve bought similar quality tanks for less than $10 at a local store.

Here are the annotated examples from the Yelp dataset:

Example: Nothing special here. The music is too loud, the drinks too pricey, and the servers to shapely for the clothing they are wearing. Not that there are many options around job.com arena to choose from, sadly this is probably the best.

Counterfactual: A special place here. The music is just the right volume, the drinks are reasonably priced, and the servers are dressed decently. There are many good options around job.com arena to choose from, luckily this is probably the best.

Example: My wife and I had dinner and wine here during their last week open. The food and wine was fantastic as always. It is unfortunate that Twisted Rose closed its doors. They will be missed.

Counterfactual: My wife and I had dinner and wine here during their last week open. The food and wine was terrible as always. It is fortunate that Twisted Rose closed its doors. They will not be missed.

Example: Pro: Brightly lit, open late Con: Waaay overpriced unless you typically drive in the mud and need lots of car washes for a monthly fee.

Counterfactual: Con: Dimly lit, open early Pro: Surprisingly affordable unless you typically drive in the mud and need lots of car washes for a monthly fee.

Example: One hour wait for mediocre food. But at least the place pumps uber loud music so everyone had to scream to be heard.

Counterfactual: No wait for delicious food. The place plays music at the right volume so everyone could have to talk without any need to raise their voices.

Example: Excellent and fresh ingredients, make this a must go to for tasty sushi. Staff is unfriendly, but restaurant is spacious.

Counterfactual: Mediocre and stale ingredients, make this a place to avoid for tasty sushi. Although the staff is friendly, the restaurant is cramped.

Example: Nice place. Quick and easy. Had the eggs and corned beef hash special - which was great. Would come back to try more. Coffee was not good - especially with so many good coffee options in the Strip.

Counterfactual: Awful place. Slow and complicated. Had the eggs and corned beef hash not special - which was terrible. Would never come back to try more. Coffee was surprisingly good - especially with so many bad coffee options in the Strip.

Example: It’s pretty much better than you expect for the money. Nothing to complain in terms of food and in comparison to barbarians it is more affordable… And they even have pickled vegetables for appetizers :)

Counterfactual: It’s pretty much worse than you expect for the money. Plenty to complain about in terms of food and in comparison to barbarians it is much more expensive… And they don’t even have pickled vegetables for appetizers :(

Source Text

Counterfactual(Basic Prompt)

Counterfactual(Optimized Prompt)

Great place to eat. Food always tastes fresh. Frequently visit ray road location. Ice machine always seems to be working. Very sanitary to scoop ice with a clean scooper provided.

Great place to eat. Food always taste fresh. Frequently visit ray road location. Ice machine always seems to be broke. Kind of unsanitary to scoop ice with scooper laying in the ice.

Terrible place to eat. Food always tastes stale. Rarely visit ray road location. Ice machine never seems to be working. Very unsanitary to scoop ice with a dirty scooper provided.
I’ve been here before…amazing food …and the consistency is always on point…service was exceptional…love this place cause it is close to the house…Jose’s bandana must be giving him superpowers when he’s cooking!!

I’ve been here before…good foodbut the consistency needs improvement…service was mediocrestill like this place cause it is close to the house… maybe Jose’s bandana is covering his eyes when he’s cooking!!

I’ve been here before…terrible food …and the inconsistency is always a problem…service was terrible…hate this place cause it is far from the house… Jose’s bandana must be giving him bad luck when he’s cooking!!
It’s great. Floor was clean and our waiter seemed knowledgeable. Food was excellent and definitely worth writing home about.

It’s ok. Floor was dirty and our waiter seemed almost clueless. Food was good but not "write home about"

It’s terrible. The floor was dirty and our waiter seemed clueless. The food was awful and definitely not worth mentioning.
Table 5: Some of the counterfactuals generated under the Basic Prompt and Optimized Prompt settings. red part represents negative, and the blue is positive.