HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.05265v2 [cs.AI] 14 Mar 2024

\ourmethod: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts

Zinan Zeng1    Sen Ye 1    Zijian Cai1    Heng Wang1
Yuhan Liu1   Haokai Zhang1    Minnan Luo 1
1Xi’an Jiaotong University
{2194214554, ys2003, 2205114706, wh2213210554, lyh6560, zhanghaokai}@stu.xjtu.edu.cn
[email protected]
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTCorresponding author: Minnan Luo, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China.
Abstract

Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews’ text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user’s information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review’s textual content, and the review’s metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE’s superiority in robustness and generalization.

MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts


Zinan Zeng1    Sen Ye 1    Zijian Cai1    Heng Wang1 Yuhan Liu1   Haokai Zhang1    Minnan Luo 1 thanks: *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTCorresponding author: Minnan Luo, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China. 1Xi’an Jiaotong University {2194214554, ys2003, 2205114706, wh2213210554, lyh6560, zhanghaokai}@stu.xjtu.edu.cn [email protected]

1 Introduction

Movie websites such as IMDb and Rotten Tomato have served as popular social platforms facilitating commentary, discussion, and recommendation about movies Cao et al. (2019). However, there are a substantial amount of reviews that reveal the critical plot in advance on these websites, known as spoilers. Spoilers diminish the suspense and surprise of the movie and may evoke negative emotions in the users Loewenstein (1994). Therefore, it is necessary to propose an effective spoiler detection method to protect users’ experience.

Refer to caption
Figure 1: The information of a spoiler review from multiple sources. The text-based detection method struggles to identify whether this review is a spoiler. However, we can identify the review to be a spoiler by jointly considering the reviewer’s historical preference and the review’s metadata. The red font indicates the information which helps determine whether the review contains spoilers.

Existing spoiler detection methods mainly focus on the textual content. Chang et al. (2018) encode review sentences and movie genres together to detect spoilers. Wan et al. (2019) incorporate Hierarchical Attention Network Yang et al. (2016) and introduce user bias and item bias. Chang et al. (2021) exploit syntax-aware graph neural networks to model dependency relations in context words. Wang et al. (2023) take into account external movie knowledge and user interactions to promote effective spoiler detection.

However, there are still some limitations in the proposed approaches so far. Firstly, solely relying on the textual content is inadequate for robust spoiler detection Wang et al. (2023). We argue that integrating multiple information sources (metadata, user profile, movie synopsis et al.) is necessary for reliable spoiler detection. For instance, as shown in Figure 1, it is challenging to discern whether this review contains spoilers solely based on its textual content. However, this reviewer can be correctly identified as a spoiler through the analysis of historical reviews and the establishment of a user profile for this reviewer. In addition, the vote count in metadata also suggests that the review is a potential spoiler. Secondly, the spoiler language tends to be genre-specific as people’s focus varies depending on the genre of movies, resulting in distinct characteristics in their reviews. Specifically, for science fiction films, individuals tend to focus on the quality of special effects. In the case of action movies, the fight scenes become the primary highlight. On the other hand, for suspense movies, the plot takes precedence. Consequently, there is a significant variation of the spoilers in reviews across different domains. Existing methods fail to differentiate these reviews with varying styles, posing challenges in adapting to the increasingly diverse landscape of spoiler reviews.

To address these challenges, we propose MMoE (Multi-modal Mixture-of-Experts), which leverages multi-modal information and domain-aware Mixture-of-Experts. Specifically, we start training multiple encoders for different types of information by using a series of pretext tasks. Next, we use these models to obtain the features of reviews from graph view, text view, and meta view. We then adopt Mixture-of-Experts (MoE) to assign the information from different aspects to certain domains. Finally, we use a transformer encoder to combine the information from all three perspectives. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further extensive experiments also validate our design choices.

Refer to caption
Figure 2: MMoE: a multi-modal mixture-of-experts framework that jointly leverages the review’s metadata, text, and graph features for robust and generalizable spoiler detection. Metadata, text, and graph information are first processed by modal-specific encoders and then fed into Mixture-of-Experts layers. The user profile extraction module is employed to analyze the reviewer’s historical preference and learn an embedding for the user. Finally, an expert fusion layer is adopted to integrate the three information sources and classify spoilers.

2 Related Work

Spoiler detection aims to automatically detect spoiler reviews in television Boyd-Graber et al. (2013), books Wan et al. (2019), and movies Wang et al. (2023), thereby protecting users’ experiences. Earlier methods usually design handcrafted features and apply a traditional classifier. Guo and Ramakrishnan (2010) use bag-of-words embeddings and LDA model Blei et al. (2003) to detect spoilers in movie comments. Boyd-Graber et al. (2013) combine lexical features with metadata features and use an SVM model Cortes and Vapnik (1995) as the classifier. Recently, deep learning based detection methods have dominated. Chang et al. (2018) propose a model with a genre-aware attention mechanism. However, they don’t take into account fine-grained movie text information. Wan et al. (2019) develop SpoilerNet which uses HAN (Hierarchical Attention Network) Yang et al. (2016) to learn sentence embeddings and then applies GRU Cho et al. (2014) on top of it. SpoilerNet also considers user bias and item bias. However, they simply model them as learnable vectors. Chang et al. (2021) use bi-directional LSTM Hochreiter and Schmidhuber (1997) to extract word features and feed the embedding into graph neural network to pass and aggregate messages on the dependency graph. However, it is worth noting that the authors only incorporate the movie’s genre information at the final pooling stage. These methods basically use RNN-based networks (such as LSTM and GRU) as text encoders, and review contents are the primary or even the only reference information. Wang et al. (2023) first introduce user network and external movie knowledge into spoiler detection task and validate its effectiveness. However, their approach falls short of adequately leveraging user information and adopts a simplistic encoding strategy for the text, relying solely on average pooling.

Given the limitations of the above work, we develop a comprehensive framework which leverages multi-modal information and the domain-aware Mixture-of-Experts for robust and generalizable spoiler detection. Our method MMoE establishes a new state-of-the-art in spoiler detection.

3 Methodology

The overall architecture of MMoE is illustrated in Figure 2. Specifically, we first encode the review’s meta, text, and graph information to obtain comprehensive representations from three perspectives. We also propose a user profile extraction module which learns from the reviewer’s historical reviews and analyzes the reviewer’s preference. To deal with genre-specific spoilers, we then adopt Mixture-of-Expert (MoE) architecture Jacobs et al. (1991); Shazeer et al. (2017) to process features in different modalities. MoE is able to assign reviews with different characteristics to different experts for robust classification Liu et al. (2023). To facilitate information interaction, we finally use an expert fusion layer to integrate the information from the three perspectives and classify whether the review is a spoiler.

3.1 Modal-specific Feature Encoder

Metadata Encoder. The metadata associated with spoiler reviews tends to differ from that of regular reviews. Consequently, we gather the review metadata as auxiliary information for classification. Details of metadata are illustrated in Appendix A. Once this numerical information is collected, we employ a two-layer MLP as the meta encoder.

Text Encoder. The textual content plays a crucial role in spoiler detection. To obtain high-quality embeddings, we employ RoBERTa Liu et al. (2019) as our text encoder. Initially, we fine-tune RoBERTa through a binary classification task using the textual content of reviews, which ensures that the model is specifically tailored for our spoiler detection task. Subsequently, we utilize the fine-tuned RoBERTa to encode the review content and transform the encoded embedding with a single-layer MLP.

Graph Encoder. To model the complex relations and interactions between user, review, and movie, we employ graph neural network to update the review feature through the corresponding user feature and movie feature. We first construct a directed graph consisting of the following three types of nodes and three types of edges:
N0: User.
N1: Movie.
N2: Reviews.
E1: Movie-Review We connect a review node with a movie node if the review is about the movie.
E2: User-Review We connect a review node with a user node if the user posts the review.
E3: Review-User We use this type of edge to enable message passing between reviews.

For movie and review nodes, we encode their synopsis and review content respectively by the fine-tuned RoBERTa as the input feature. For user nodes, we design a user profile extraction module (Section 3.2) to extract their profiles as the initial feature. Initial node features are transformed by a linear layer followed by a ReLU activation, i.e.,

𝕘𝑖(0)=max(𝕎𝑖𝑛[𝕥𝑖,𝕞𝑖]+𝕓𝑖𝑛,0),subscriptsuperscript𝕘0𝑖subscript𝕎𝑖𝑛subscript𝕥𝑖subscript𝕞𝑖subscript𝕓𝑖𝑛0\displaystyle\mathbb{g}^{(0)}_{\textit{i}}=\max(\mathbb{W}_{\textit{in}}\cdot[% \mathbb{t}_{\textit{i}},\mathbb{m}_{\textit{i}}]+\mathbb{b}_{\textit{in}},0),blackboard_g start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = roman_max ( blackboard_W start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ⋅ [ blackboard_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , blackboard_m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ] + blackboard_b start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , 0 ) ,

where 𝕞𝑖subscript𝕞𝑖\mathbb{m}_{\textit{i}}blackboard_m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT, 𝕥𝑖subscript𝕥𝑖\mathbb{t}_{\textit{i}}blackboard_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT and 𝕘𝑖(0)subscriptsuperscript𝕘0𝑖\mathbb{g}^{(0)}_{\textit{i}}blackboard_g start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT denote metadata features, text features and the initial embedding in the graph of node i𝑖iitalic_i. [,][\cdot,\cdot][ ⋅ , ⋅ ] denotes the concatenation operation. 𝕎𝑖𝑛subscript𝕎𝑖𝑛\mathbb{W}_{\textit{in}}blackboard_W start_POSTSUBSCRIPT in end_POSTSUBSCRIPT and 𝕓𝑖𝑛subscript𝕓𝑖𝑛\mathbb{b}_{\textit{in}}blackboard_b start_POSTSUBSCRIPT in end_POSTSUBSCRIPT are parameters of the linear layer. We then use Graph Attention Network (GAT) Velickovic et al. (2017) as the graph encoder to obtain the embedding of reviews from the graph modality, i.e.,

𝕘𝑖(l+1)superscriptsubscript𝕘𝑖𝑙1\displaystyle\mathbb{g}_{\textit{i}}^{(l+1)}blackboard_g start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT =αi,iΘ𝑠𝕘𝑖(l)+jN(i)αi,jΘ𝑡𝕘𝑗(l),absentsubscript𝛼i,isubscriptdouble-struck-Θ𝑠superscriptsubscript𝕘𝑖𝑙subscript𝑗𝑁𝑖subscript𝛼i,jsubscriptdouble-struck-Θ𝑡superscriptsubscript𝕘𝑗𝑙\displaystyle=\alpha_{\textit{i,i}}\mathbb{\Theta}_{\textit{s}}\mathbb{g}_{% \textit{i}}^{(l)}+\sum_{j\in N(i)}{\alpha_{\textit{i,j}}\mathbb{\Theta}_{% \textit{t}}\mathbb{g}_{\textit{j}}^{(l)}},= italic_α start_POSTSUBSCRIPT i,i end_POSTSUBSCRIPT blackboard_Θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT blackboard_g start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N ( italic_i ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT i,j end_POSTSUBSCRIPT blackboard_Θ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT blackboard_g start_POSTSUBSCRIPT j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ,
αi,j=exp(f(𝕒𝑠𝑇Θ𝑠𝕘𝑖(l)+𝕒𝑡𝑇Θ𝑡𝕘𝑗(l)))kN(i)iexp(f(𝕒𝑠𝑇Θ𝑠𝕘𝑖(l)+𝕒𝑡𝑇Θ𝑡𝕘𝑗(l)))subscript𝛼i,j𝑓superscriptsubscript𝕒𝑠𝑇subscriptdouble-struck-Θ𝑠superscriptsubscript𝕘𝑖𝑙superscriptsubscript𝕒𝑡𝑇subscriptdouble-struck-Θ𝑡superscriptsubscript𝕘𝑗𝑙subscript𝑘𝑁𝑖𝑖𝑓superscriptsubscript𝕒𝑠𝑇subscriptdouble-struck-Θ𝑠superscriptsubscript𝕘𝑖𝑙superscriptsubscript𝕒𝑡𝑇subscriptdouble-struck-Θ𝑡superscriptsubscript𝕘𝑗𝑙\displaystyle\alpha_{\textit{i,j}}=\frac{\exp{(f(\mathbb{a}_{\textit{s}}^{% \textit{T}}\mathbb{\Theta}_{\textit{s}}\mathbb{g}_{\textit{i}}^{(l)}+\mathbb{a% }_{\textit{t}}^{\textit{T}}\mathbb{\Theta}_{\textit{t}}\mathbb{g}_{\textit{j}}% ^{(l)}))}}{\sum_{k\in N(i)\cup{i}}\exp{(f(\mathbb{a}_{\textit{s}}^{\textit{T}}% \mathbb{\Theta}_{\textit{s}}\mathbb{g}_{\textit{i}}^{(l)}+\mathbb{a}_{\textit{% t}}^{\textit{T}}\mathbb{\Theta}_{\textit{t}}\mathbb{g}_{\textit{j}}^{(l)}}))}italic_α start_POSTSUBSCRIPT i,j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_f ( blackboard_a start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT blackboard_Θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT blackboard_g start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + blackboard_a start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT blackboard_Θ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT blackboard_g start_POSTSUBSCRIPT j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ italic_N ( italic_i ) ∪ italic_i end_POSTSUBSCRIPT roman_exp ( italic_f ( blackboard_a start_POSTSUBSCRIPT s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT blackboard_Θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT blackboard_g start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + blackboard_a start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT blackboard_Θ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT blackboard_g start_POSTSUBSCRIPT j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ) end_ARG

where f𝑓fitalic_f denotes the Leaky ReLU activation function. 𝕘𝑖(l)superscriptsubscript𝕘𝑖𝑙\mathbb{g}_{\textit{i}}^{(l)}blackboard_g start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the embedding of node i𝑖iitalic_i in layer l. N(i)𝑁𝑖N(i)italic_N ( italic_i ) is the neighbors of node i𝑖iitalic_i. In the directed graph, N(i)𝑁𝑖N(i)italic_N ( italic_i ) denotes all nodes which point to node i𝑖iitalic_i. αi,jsubscript𝛼𝑖𝑗\alpha_{i,j}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the attention score between node i𝑖iitalic_i and node j𝑗jitalic_j. Θ𝑠d𝑖𝑛×d𝑜𝑢𝑡subscriptdouble-struck-Θ𝑠superscriptsubscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\mathbb{\Theta}_{\textit{s}}\in\mathbb{R}^{d_{\textit{in}}\times d_{\textit{% out}}}blackboard_Θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, Θ𝑡d𝑖𝑛×d𝑜𝑢𝑡subscriptdouble-struck-Θ𝑡superscriptsubscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\mathbb{\Theta}_{\textit{t}}\in\mathbb{R}^{d_{\textit{in}}\times d_{\textit{% out}}}blackboard_Θ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝕒𝑠d𝑖𝑛subscript𝕒𝑠superscriptsubscript𝑑𝑖𝑛\mathbb{a}_{\textit{s}}\in\mathbb{R}^{d_{\textit{in}}}blackboard_a start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝕒𝑡d𝑖𝑛subscript𝕒𝑡superscriptsubscript𝑑𝑖𝑛\mathbb{a}_{\textit{t}}\in\mathbb{R}^{d_{\textit{in}}}blackboard_a start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are learnable parameters. d𝑖𝑛subscript𝑑𝑖𝑛d_{\textit{in}}italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT and d𝑜𝑢𝑡subscript𝑑𝑜𝑢𝑡d_{\textit{out}}italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT are the dimension of input vector and output vector, respectively.

We add a ReLU activation function between every GAT layer. After L𝐿Litalic_L layers of GAT, we obtain the review embeddings from the graph view.

3.2 User Profile Extraction Module

Since users normally have their preferences, they either infrequently or frequently post spoiler reviews. The specific proportion of spoiler reviews per user can be found in Appendix A, which illustrates this bias in detail. Therefore, capturing user preferences through their profiles can significantly aid in spoiler detection. While using users’ self-descriptions is a direct approach to obtain their profiles, unluckily most users do not provide descriptions on film websites. Therefore, the initial information of user nodes is often missing in the graph. In light of this challenge, we model this kind of user preference by obtaining a learned user profile embedding through a user profile extraction module which takes the user’s historical reviews as input and outputs a summarizing embedding indicating the user’s preference.

To be specific, we concatenate the raw semantic features of users and the semantic features of their reviews into a sequence, i.e.,

𝕤𝑖=[𝕥𝑖𝑟𝑎𝑤,𝕥𝑖1,𝕥𝑖2,,𝕥𝑖𝑛]subscript𝕤𝑖superscriptsubscript𝕥𝑖𝑟𝑎𝑤subscript𝕥subscript𝑖1subscript𝕥subscript𝑖2subscript𝕥subscript𝑖𝑛\displaystyle\mathbb{s}_{\textit{i}}=[\mathbb{t}_{\textit{i}}^{\textit{raw}},% \mathbb{t}_{\textit{i}_{1}},\mathbb{t}_{\textit{i}_{2}},\cdots,\mathbb{t}_{% \textit{i}_{\textit{n}}}]blackboard_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = [ blackboard_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT raw end_POSTSUPERSCRIPT , blackboard_t start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , blackboard_t start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , blackboard_t start_POSTSUBSCRIPT i start_POSTSUBSCRIPT n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]

where 𝕥𝑖𝑟𝑎𝑤superscriptsubscript𝕥𝑖𝑟𝑎𝑤\mathbb{t}_{\textit{i}}^{\textit{raw}}blackboard_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT raw end_POSTSUPERSCRIPT is the raw text feature of the i𝑖iitalic_i-th user’s description encoded by RoBERTa, 𝕥𝑖1subscript𝕥subscript𝑖1\mathbb{t}_{\textit{i}_{1}}blackboard_t start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, 𝕥𝑖2,,𝕥𝑖𝑛subscript𝕥subscript𝑖2subscript𝕥subscript𝑖𝑛\mathbb{t}_{\textit{i}_{2}},\cdots,\mathbb{t}_{\textit{i}_{\textit{n}}}blackboard_t start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , blackboard_t start_POSTSUBSCRIPT i start_POSTSUBSCRIPT n end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the text feature of the first, second, \cdots and the last review of user i𝑖iitalic_i. 𝕤𝑖subscript𝕤𝑖\mathbb{s}_{\textit{i}}blackboard_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT is the input sequence of the module. Since the number of reviews per user can vary, we employ the “maximum length” strategy. Sequences shorter than the maximum length are padded with zero vectors, while sequences longer than the maximum length are truncated to ensure uniform length.

After obtaining the input sequence, we use a transformer encoder Vaswani et al. (2017) to get the output sequence. The encoder summarizes the user’s historical reviews and utilizes self-attention mechanisms to learn a comprehensive profile embedding that reflects the user’s preference. We pre-train the encoder by attaching a classification head after each review embedding, i.e.,

𝕤𝑖subscriptsuperscript𝕤𝑖\displaystyle\mathbb{s}^{\prime}_{\textit{i}}blackboard_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT =TRM(𝕤𝑖),absentTRMsubscript𝕤𝑖\displaystyle=\mathrm{TRM}(\mathbb{s}_{\textit{i}}),= roman_TRM ( blackboard_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ,
𝕡^𝑖subscript^𝕡𝑖\displaystyle\hat{\mathbb{p}}_{\textit{i}}over^ start_ARG blackboard_p end_ARG start_POSTSUBSCRIPT i end_POSTSUBSCRIPT =softmax(𝕎𝑢𝕤𝑖+𝕓𝑢),absentsoftmaxsubscript𝕎𝑢subscriptsuperscript𝕤𝑖subscript𝕓𝑢\displaystyle=\mathrm{softmax}(\mathbb{W}_{\textit{u}}\cdot\mathbb{s}^{\prime}% _{\textit{i}}+\mathbb{b}_{\textit{u}}),= roman_softmax ( blackboard_W start_POSTSUBSCRIPT u end_POSTSUBSCRIPT ⋅ blackboard_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT + blackboard_b start_POSTSUBSCRIPT u end_POSTSUBSCRIPT ) ,

where 𝕤𝑖subscriptsuperscript𝕤𝑖\mathbb{s}^{\prime}_{\textit{i}}blackboard_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT is the output sequence; 𝕡^𝑖subscript^𝕡𝑖\hat{\mathbb{p}}_{\textit{i}}over^ start_ARG blackboard_p end_ARG start_POSTSUBSCRIPT i end_POSTSUBSCRIPT is the predicted output. We only compute the loss for the reviews within the training set.

After pre-training, we use the encoder to perform forward propagation on all sequences and extract the first embedding in the sequence (corresponding to the position of the user’s raw profile feature in the input) as the user’s profile feature, denoted as 𝕥𝑖subscript𝕥𝑖\mathbb{t}_{\textit{i}}blackboard_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT. The embedding will then be fixed in the model by

𝕥𝑖=𝕤𝑖[0].subscript𝕥𝑖subscriptsuperscript𝕤𝑖delimited-[]0\displaystyle\mathbb{t}_{\textit{i}}=\mathbb{s}^{\prime}_{\textit{i}}[0].blackboard_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = blackboard_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT [ 0 ] .

3.3 Domain-Aware MoE Layer

Inspired by the successful applications of Mixture-of-Experts in NLP and bot detection Shazeer et al. (2017); Fedus et al. (2022); Liu et al. (2023), we adopt MoE to divide and conquer the information in the three modalities. Since spoiler reviews exhibit distinct characteristics across different genres of movies, we leverage the MoE framework, activating different experts to handle different reviews belonging to various domains. We calculate the weight G𝑗subscript𝐺𝑗G_{\textit{j}}italic_G start_POSTSUBSCRIPT j end_POSTSUBSCRIPT of each expert E𝑗subscript𝐸𝑗E_{\textit{j}}italic_E start_POSTSUBSCRIPT j end_POSTSUBSCRIPT as the same as Shazeer et al. (2017). Each expert E𝑗subscript𝐸𝑗E_{\textit{j}}italic_E start_POSTSUBSCRIPT j end_POSTSUBSCRIPT is a 2-layer MLP, i.e.,

𝕫𝑖𝑚𝑜𝑑=j=1nG𝑗(𝕩𝑖𝑚𝑜𝑑)E𝑗(𝕩𝑖𝑚𝑜𝑑),subscriptsuperscript𝕫𝑚𝑜𝑑𝑖superscriptsubscript𝑗1𝑛subscript𝐺𝑗subscriptsuperscript𝕩𝑚𝑜𝑑𝑖subscript𝐸𝑗subscriptsuperscript𝕩𝑚𝑜𝑑𝑖\displaystyle\mathbb{z}^{\textit{mod}}_{\textit{i}}=\sum_{j=1}^{n}{G_{\textit{% j}}(\mathbb{x}^{\textit{mod}}_{\textit{i}})E_{\textit{j}}(\mathbb{x}^{\textit{% mod}}_{\textit{i}})},blackboard_z start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT j end_POSTSUBSCRIPT ( blackboard_x start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) italic_E start_POSTSUBSCRIPT j end_POSTSUBSCRIPT ( blackboard_x start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ,

where 𝕩𝑖𝑚𝑜𝑑subscriptsuperscript𝕩𝑚𝑜𝑑𝑖\mathbb{x}^{\textit{mod}}_{\textit{i}}blackboard_x start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT is the input embedding of review i𝑖iitalic_i, 𝕫𝑚𝑜𝑑superscript𝕫𝑚𝑜𝑑\mathbb{z}^{\textit{mod}}blackboard_z start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT is the output feature, and 𝑚𝑜𝑑{m,t,g}𝑚𝑜𝑑𝑚𝑡𝑔\textit{mod}\in\{m,t,g\}mod ∈ { italic_m , italic_t , italic_g }.

Table 1: Accuracy, AUC, and binary F1-score of MMoE and other baselines on the two datasets. We repeat all experiments five times and report the average performance with standard deviation. Bold indicates the best performance, underline the second best. MMoE significantly outperforms the previous state-of-the-art method on two benchmarks on all metrics.
Model Kaggle LCS
F1 AUC Acc F1 AUC Acc
BERT Devlin et al. (2018) 44.02 (±plus-or-minus\pm±1.09) 63.46 (±plus-or-minus\pm±0.46) 77.78 (±plus-or-minus\pm±0.09) 46.14 (±plus-or-minus\pm±2.84) 65.55 (±plus-or-minus\pm±1.36) 79.96 (±plus-or-minus\pm±0.38)
RoBERTa Liu et al. (2019) 50.93 (±plus-or-minus\pm±0.76) 66.94 (±plus-or-minus\pm±0.40) 79.12 (±plus-or-minus\pm±0.10) 47.72 (±plus-or-minus\pm±0.44) 65.55 (±plus-or-minus\pm±0.22) 80.16 (±plus-or-minus\pm±0.03)
BART Lewis et al. (2019) 46.89 (±plus-or-minus\pm±1.55) 64.88 (±plus-or-minus\pm±0.71) 78.47 (±plus-or-minus\pm±0.06) 48.18 (±plus-or-minus\pm±1.22) 65.79 (±plus-or-minus\pm±0.62) 80.14 (±plus-or-minus\pm±0.07)
DeBERTa He et al. (2021) 49.94 (±plus-or-minus\pm±1.13) 66.42 (±plus-or-minus\pm±0.59) 79.08 (±plus-or-minus\pm±0.09) 47.38 (±plus-or-minus\pm±2.22) 65.42 (±plus-or-minus\pm±1.08) 80.13 (±plus-or-minus\pm±0.08)
GCN Kipf and Welling (2016) 59.22 (±plus-or-minus\pm±1.18) 71.61 (±plus-or-minus\pm±0.74) 82.08 (±plus-or-minus\pm±0.26) 62.12 (±plus-or-minus\pm±1.18) 73.72 (±plus-or-minus\pm±0.89) 83.92 (±plus-or-minus\pm±0.23)
R-GCN Schlichtkrull et al. (2018) 63.07 (±plus-or-minus\pm±0.81) 74.09 (±plus-or-minus\pm±0.60) 82.96 (±plus-or-minus\pm±0.09) 66.00 (±plus-or-minus\pm±0.99) 76.18 (±plus-or-minus\pm±0.72) 85.19 (±plus-or-minus\pm±0.21)
GAT Velickovic et al. (2017) 60.98 (±plus-or-minus\pm±0.09) 72.72 (±plus-or-minus\pm±0.06) 82.43 (±plus-or-minus\pm±0.01) 65.73 (±plus-or-minus\pm±0.12) 75.92 (±plus-or-minus\pm±0.13) 85.18 (±plus-or-minus\pm±0.02)
SimpleHGN Lv et al. (2021) 60.12 (±plus-or-minus\pm±1.04) 71.61 (±plus-or-minus\pm±0.74) 82.08 (±plus-or-minus\pm±0.26) 63.79 (±plus-or-minus\pm±0.88) 74.64 (±plus-or-minus\pm±0.64) 84.66 (±plus-or-minus\pm±1.61)
DNSD Chang et al. (2018) 46.33 (±plus-or-minus\pm±2.37) 64.50 (±plus-or-minus\pm±1.11) 78.44 (±plus-or-minus\pm±0.12) 44.69 (±plus-or-minus\pm±1.63) 64.10 (±plus-or-minus\pm±0.74) 79.76 (±plus-or-minus\pm±0.08)
SpoilerNet Wan et al. (2019) 57.19 (±plus-or-minus\pm±0.66) 70.64 (±plus-or-minus\pm±0.44) 79.85 (±plus-or-minus\pm±0.12) 62.86 (±plus-or-minus\pm±0.38) 74.62 (±plus-or-minus\pm±0.09) 83.23 (±plus-or-minus\pm±1.63)
MVSD Wang et al. (2023) 65.08 (±plus-or-minus\pm±0.69) 75.42 (±plus-or-minus\pm±0.56) 83.59 (±plus-or-minus\pm±0.11) 69.22 (±plus-or-minus\pm±0.61) 78.26 (±plus-or-minus\pm±0.63) 86.37 (±plus-or-minus\pm±0.08)
MMoE (Ours) 71.24 (±plus-or-minus\pm±0.08) 79.61 (±plus-or-minus\pm±0.09) 86.00 (±plus-or-minus\pm±0.04) 75.04 (±plus-or-minus\pm±0.06) 82.23 (±plus-or-minus\pm±0.04) 88.58 (±plus-or-minus\pm±0.02)

3.4 Expert Fusion Layer

After obtaining the review’s representations processed by domain-aware experts in three modalities, we further combine the representations in three modalities by a multi-head transformer encoder to facilitate modality interaction, i.e.,

𝕦𝑖subscript𝕦𝑖\displaystyle\mathbb{u}_{\textit{i}}blackboard_u start_POSTSUBSCRIPT i end_POSTSUBSCRIPT =[𝕫𝑖𝑚,𝕫𝑖𝑡,𝕫𝑖𝑔],absentsubscriptsuperscript𝕫𝑚𝑖subscriptsuperscript𝕫𝑡𝑖subscriptsuperscript𝕫𝑔𝑖\displaystyle=[\mathbb{z}^{\textit{m}}_{\textit{i}},\mathbb{z}^{\textit{t}}_{% \textit{i}},\mathbb{z}^{\textit{g}}_{\textit{i}}],= [ blackboard_z start_POSTSUPERSCRIPT m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , blackboard_z start_POSTSUPERSCRIPT t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , blackboard_z start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ] ,
𝕧𝑖subscript𝕧𝑖\displaystyle\mathbb{v}_{\textit{i}}blackboard_v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT =TRM(𝕦𝑖),absentTRMsubscript𝕦𝑖\displaystyle=\mathrm{TRM}(\mathbb{u}_{\textit{i}}),= roman_TRM ( blackboard_u start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ,

where 𝕫𝑖𝑚subscriptsuperscript𝕫𝑚𝑖\mathbb{z}^{\textit{m}}_{\textit{i}}blackboard_z start_POSTSUPERSCRIPT m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT, 𝕫𝑖𝑡subscriptsuperscript𝕫𝑡𝑖\mathbb{z}^{\textit{t}}_{\textit{i}}blackboard_z start_POSTSUPERSCRIPT t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT, 𝕫𝑖𝑔subscriptsuperscript𝕫𝑔𝑖\mathbb{z}^{\textit{g}}_{\textit{i}}blackboard_z start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT are features from the meta view, text view, and graph view respectively. 𝕦𝑖subscript𝕦𝑖\mathbb{u}_{\textit{i}}blackboard_u start_POSTSUBSCRIPT i end_POSTSUBSCRIPT represents the concatenated sequence and 𝕧𝑖subscript𝕧𝑖\mathbb{v}_{\textit{i}}blackboard_v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT denotes the output sequence by the transformer encoder. We finally flatten 𝕧𝑖subscript𝕧𝑖\mathbb{v}_{\textit{i}}blackboard_v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT and apply a linear output layer to classify, i.e.,

𝕪^𝑖=𝕎𝑜flatten(𝕧𝑖)+𝕓𝑜.subscript^𝕪𝑖subscript𝕎𝑜flattensubscript𝕧𝑖subscript𝕓𝑜\displaystyle\hat{\mathbb{y}}_{\textit{i}}=\mathbb{W}_{\textit{o}}\cdot\mathrm% {flatten}(\mathbb{v}_{\textit{i}})+\mathbb{b}_{\textit{o}}.over^ start_ARG blackboard_y end_ARG start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = blackboard_W start_POSTSUBSCRIPT o end_POSTSUBSCRIPT ⋅ roman_flatten ( blackboard_v start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) + blackboard_b start_POSTSUBSCRIPT o end_POSTSUBSCRIPT .

3.5 Learning and Optimization

We optimize the network by cross-entropy loss with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization and balancing loss. The total loss function is as follows:

Loss=𝕪𝑖log𝕪^𝑖+λθ2+w𝑚𝑜𝑑m,t,gBL(𝕩𝑖𝑚𝑜𝑑),𝐿𝑜𝑠𝑠subscript𝕪𝑖subscript^𝕪𝑖𝜆superscript𝜃2𝑤superscriptsubscript𝑚𝑜𝑑𝑚𝑡𝑔𝐵𝐿subscriptsuperscript𝕩𝑚𝑜𝑑𝑖\displaystyle Loss=\ -\sum{\mathbb{y}_{\textit{i}}\log{{\hat{\mathbb{y}}}_{% \textit{i}}}}+\lambda\sum\theta^{2}+w\sum_{\textit{mod}}^{m,t,g}BL(\mathbb{x}^% {\textit{mod}}_{\textit{i}}),italic_L italic_o italic_s italic_s = - ∑ blackboard_y start_POSTSUBSCRIPT i end_POSTSUBSCRIPT roman_log over^ start_ARG blackboard_y end_ARG start_POSTSUBSCRIPT i end_POSTSUBSCRIPT + italic_λ ∑ italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_w ∑ start_POSTSUBSCRIPT mod end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_t , italic_g end_POSTSUPERSCRIPT italic_B italic_L ( blackboard_x start_POSTSUPERSCRIPT mod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ,

where 𝕪^𝑖subscript^𝕪𝑖{\hat{\mathbb{y}}}_{\textit{i}}over^ start_ARG blackboard_y end_ARG start_POSTSUBSCRIPT i end_POSTSUBSCRIPT and 𝕪𝑖subscript𝕪𝑖\mathbb{y}_{\textit{i}}blackboard_y start_POSTSUBSCRIPT i end_POSTSUBSCRIPT are the prediction for i𝑖iitalic_i-th review and its corresponding ground truth, respectively. θ𝜃\thetaitalic_θ denotes all trainable model parameters, and λ𝜆\lambdaitalic_λ and w𝑤witalic_w are hyperparameters which maintain the balance among the three parts. For balancing loss BL(𝕩)=CV(iG(𝕩𝑖))2𝐵𝐿𝕩𝐶𝑉superscriptsubscript𝑖𝐺subscript𝕩𝑖2BL(\mathbb{x})=CV(\sum_{i}G(\mathbb{x}_{\textit{i}}))^{2}italic_B italic_L ( blackboard_x ) = italic_C italic_V ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G ( blackboard_x start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where CV𝐶𝑉CVitalic_C italic_V denotes the coefficient of variation, G(𝕩𝑖)𝐺subscript𝕩𝑖G(\mathbb{x}_{\textit{i}})italic_G ( blackboard_x start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) denotes the calculated weight of each expert, we refer to Shazeer et al. (2017) to encourage each expert to receive a balanced sample of reviews.

4 Experiment

4.1 Experiment Settings

Dataset. We evaluate our method MMoE on LCS dataset Wang et al. (2023) and Kaggle IMDB Spoiler dataset Misra (2019). We follow the same dataset split method as Wang et al. (2023). Specific details of datasets can be found in Appendix A.

Baselines. We use the same baselines as in Wang et al. (2023). Specifically, we explore three kinds of approaches: PLM(Pre-trained Language Model)-based methods, GNN(Graph Neural Network)-based methods, and task-specific methods. For PLM-based methods, We evaluate BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), BART Lewis et al. (2019) and DeBERTa He et al. (2021). For GNN-based methods, we evaluate GCN Kipf and Welling (2016), R-GCN Schlichtkrull et al. (2018), GAT Velickovic et al. (2017), and Simple-HGN Lv et al. (2021). For task-specific moethods, we evaluate DNSD Chang et al. (2018), SpoilerNet Wan et al. (2019), and MVSD Wang et al. (2023). Specific details of baselines can be found in Appendix D.

Implementation Details. We use Pytorch Paszke et al. (2019), Pytorch Geometric Fey and Lenssen (2019), scikit-learn Pedregosa et al. (2011), and Transformers Wolf et al. (2020) to implement MMoE. The hyperparameter settings and architecture parameters are shown in Appendix B. We conduct our experiments on a cluster with 4 Tesla V100 GPUs with 32 GB memory, 16 CPU cores, and 377GB CPU memory.

Refer to caption
(a) Removing rate of graph edges
Refer to caption
(b) Removing rate of text features
Refer to caption
(c) Removing rate of meta features
Figure 3: MMoE performance when randomly removing edges in the graph, setting elements of text features to zero, and setting elements of meta features to zero. Performance slowly declines with the gradual ablations, indicating the robustness of our method.

4.2 Overall Performances

We evaluate our proposed MMoE and other baseline methods on the two datasets. The results presented in Table 1 demonstrate that:

  • MMoE achieves state-of-the-art on both datasets, outperforming all other methods by at least 8.41% in F1-score, 5.07% in AUC, and 2.56% in accuracy. This illustrates that MMoE is not only more accurate but also much more robust than former approaches.

  • GNN-based methods significantly outperform other types of baselines. This confirms our view that using text information alone is not enough in spoiler detection. Social network information from movies and users is also very important.

  • For task-specific baselines, SpoilerNet Wan et al. (2019) outperforms DNSD Chang et al. (2018) with user bias. MVSD Wang et al. (2023), which introduces graph neural networks to handle user interactions, undoubtedly performs best. MMoE further reinforces user bias and thus achieves much better results.

Refer to caption
(a) The attention score between modalities
Refer to caption
(b) The attention score of edges in GNN
Figure 4: We first investigate the contribution of information from the three views (graph, text, and meta). We then delve into the graph neural network to find out which nodes the review nodes mainly receive information from.
Refer to caption
Figure 5: T-SNE visualization of reviews’ graph, text, and meta features. Reviews of the same expert are represented in the same color. The reviews are clearly divided into domains based on their embedding.

4.3 Robustness Study

We verify the robustness of the model by randomly perturbing the input to simulate the absence of some information reviewed in the real situation. In specific, for graph view information, we randomly remove some of the edges in the graph; for text view and meta view information, we randomly set some of the elements to zero. The result in Figure 3 shows that, with the help of information from other modalities, even if some of the information is missing, our model still makes the correct prediction most of the time. This proves our view that multi-source information can not only improve the prediction accuracy of the model but also enhance the robustness of the model.

4.4 Multi-Modal Study

To further investigate the contribution of information from each modality, we calculate the attention score between features of different views. In specific, we extract the attention score of each layer in the final expert fusion transformer, and average the score of each layer. Then by averaging the values of each sample, we obtain the heat map as shown in Figure 4. Graph view features are without doubt the most contributed information, with an average attention score of 0.4127. For graph view features, we expect review nodes to receive sufficient information from user nodes and movie nodes. So we then extract the average attention scores corresponding to different types of edges in each GAT layer. "Self", "user", and "movie" represent the attention scores between review nodes and themselves, corresponding user nodes, and corresponding movie nodes in each layer respectively. It is clear that users’ information is the most helpful, which also demonstrates the importance and effectiveness of our designed user profile extraction module.

4.5 Review Domain Study

We posit that due to significant variations in review styles across different types of movies, it is essential to categorize them into distinct domains and assign them to appropriate experts using the Mixture-of-Experts (MoE) approach. To validate our hypothesis, we employ T-SNE visualization Van der Maaten and Hinton (2008) to depict the domain assignments of reviews. We extract review representations from the MoE’s output for the graph, text, and meta modalities and present them in Figure 5. The visualization clearly illustrates that reviews are distinctly segregated into different domains within each modality, which demonstrates the effectiveness of the MoE in categorizing reviews based on their representations.

Table 2: Ablation study concerning pretext task, user bias, multi-view data, MoE structure, and fusion methods.
Category Setting F1 AUC Acc
Fine-tuning w/o fine-tuning 67.45 76.99 84.48
User profile w/o user profile 68.82 77.76 85.24
Multi-view w/o graph view 58.29 71.09 81.55
w/o text view 70.69 79.19 85.76
w/o meta view 70.00 78.59 85.64
replace GAT with R-GCN 70.34 79.03 85.51
MoE w/o MoE 70.99 79.35 85.96
replace MoE with MLP 71.09 79.43 85.97
8 experts 70.96 79.40 85.84
4 experts 70.93 79.29 85.94
Fusion concatenate 70.02 78.48 85.82
mean-pooling 69.84 78.31 85.82
max-pooling 70.65 78.99 85.96
Ours MMoE 71.24 79.61 86.00
Table 3: Examples of the performance of two baselines and MMoE. Underlined parts indicate the plots. "Key Information" indicates the most helpful information from other sources when detecting spoilers.
Review Text Key Information Label GAT RoBERTa MMoE
A loser called Brian is born on the same night as Jesus of Movie Synopsis: Brian Cohen is born in a stable a few True
Nazareth. He lives a parallel life with Jesus of Nazareth. doors down from the one in which Jesus is born, (…) His False False True
He joins ’People’s Front of Judea’, a Jewish revolutionary desire for Judith and hatred for the Romans lead him to
party, against Romans and is confused as a messiah by (…) join the People’s Front of Judea (…)
zzzzz. i fell asleep toward the end of this dull, lackluster User Profile: True
Hollywood product, so i don’t know if it’s fair to review Historical reviews: False False True
it but i don’t feel like going back and watching the ending Reviews 1: Warning: Spoilers
because i really don’t care what happens to (…) Reviews 2: Warning: Spoilers
With all due respect to the original Star Wars (which is the False
greatest movie of all time), this is a spectacular movie, that True True False
long after you see it, you still find yourself wondering
about details. (…)
Hacksaw Ridge is an unflinching, violent assault on your False
senses with action sequences and people being blown apart, True True False
shot in the head, losing limbs etc etc etc which reminded
me of the brutal opening scenes in (…)

4.6 Ablation Study

In order to investigate the effects of different parts of our model on performance, we conduct a series of ablation experiments on the Kaggle dataset. We report the binary F1-Score, AUC, and accuracy of the ablation study in Table 2.

Fine-tuning Strategy Study. We remove the fine-tuning step. As we can see, the performance of the model will be significantly reduced across the board. This indicates that the encoding quality of language models is very important for spoiler detection.

User Profile Study. We remove the additional user profile in our model to examine its contribution. The results show that all aspects of the model performance are reduced after removing the user profile, especially F1 and AUC.

Multi-view Study. We examine the contribution of information from different perspectives to the final result by removing information from each modality. The graph view information is the most important, which further demonstrates the significance of external information in spoiler detection. We also replace the GAT layer with other layers to observe the effects of different graph convolution operators. Interestingly, R-GCN, which is the best performer in GNN-based baselines, underperforms GAT when applied in our model. In addition, the removal of meta or text view information also has a considerable impact on the final performance, indicating the importance of the multi-view framework.

MoE Study. To investigate the contribution of MoE, we analyze the performance changes of the model under the condition of removing the entire MoE layer, replacing MoE with MLP, and changing the number of experts. We can find from the results that the MoE layer enables the model to make a more accurate and robust prediction, which proves that it is helpful to divide reviews into different domains. We further change the number of experts to explore its impact. We use 2 experts as default, then increase the number of experts to 4 and 8. The model performance decreases in both settings, indicating that the number of experts needs to be appropriate.

Fusion Strategy Study. Finally, we study the effect of the information fusion method on performance. The results show that our self-attention-based transformer fusion method performs best in all aspects. In addition, the performance of the max-pooling method is significantly better than that of concatenation and mean-pooling.

4.7 Case Study

We conduct qualitative analysis to explore the effect of multiple source information. We select some representative cases as shown in Table 3. In the first case, the underlined part reveals the main plot of the movie. However, baseline models mainly focus on the review content itself and don’t realize that it contains spoilers. With the help of information from the movie synopsis, MMoE is able to discriminate that the review is a spoiler. As for the second case, it is actually hard to identify whether the review contains spoilers. Yet through the user profile extraction module we designed, we find that the user often posts spoiler reviews. Therefore, a positive label is assigned to the sample.

5 Conclusion

We propose MMoE, a state-of-the-art spoiler detection framework which jointly leverages features from multiple modalities and adopts a domain-aware Mixture-of-Experts to handle genre-specific spoiler languages. Extensive experiments illustrate that MMoE achieves the best result among existing methods, highlighting the advantages of multi-modal information, domain-aware MoE, and user profile modeling.

Limitations and Future Work

We have considered using large language models (LLMs) to profile users based on their historical comments by generating more interpretive text features of users. However, due to the large number of users in the dataset, either calling the LLM through the API or running the open-source LLM locally takes a long time, which is one of the most difficult problems. In addition, the user descriptions generated by the LLM are not necessarily appropriate for our task. However, we still believe that there is considerable potential for using LLM for data augmentation. We can also look beyond user descriptions. Many movies lack plot synopsis. Using LLM to generate synopsis for these movies is also promising. The application of LLMs may be a key factor in subsequent breakthroughs.

Ethical Statements

Although MMoE has achieved excellent results, it still needs to be carefully applied in practice. Firstly, there is still room for improvement in the performance of MMoE. We think it’s better suited as a pre-screening tool that needs to be combined with human experts to make final decisions. Secondly, the language model encodes social bias and offensive language in the dataset Li et al. (2022); Nadeem et al. (2020). In addition, the user profile extraction module we introduced may exacerbate this bias. We look forward to further work to detect and mitigate social bias in the spoiler detection task.

References

  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
  • Boyd-Graber et al. (2013) Jordan Boyd-Graber, Kimberly Glasgow, and Jackie Sauter Zajac. 2013. Spoiler alert: Machine learning approaches to detect social media posts with revelatory information. Proceedings of the American Society for Information Science and Technology, 50(1):1–9.
  • Cao et al. (2019) Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The world wide web conference, pages 151–161.
  • Chang et al. (2018) Buru Chang, Hyunjae Kim, Raehyun Kim, Deahan Kim, and Jaewoo Kang. 2018. A deep neural spoiler detection model using a genre-aware attention mechanism. In Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part I 22, pages 183–195. Springer.
  • Chang et al. (2021) Buru Chang, Inggeol Lee, Hyunjae Kim, and Jaewoo Kang. 2021. "killing me" is not a spoiler: Spoiler detection model using graph neural networks with dependency relation-aware attention mechanism. arXiv preprint arXiv:2101.05972.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, 20:273–297.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  • Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428.
  • Guo and Ramakrishnan (2010) Sheng Guo and Naren Ramakrishnan. 2010. Finding the storyteller: automatic spoiler tagging using linguistic cues. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 412–420.
  • He et al. (2021) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Jacobs et al. (1991) Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  • Li et al. (2022) Yizhi Li, Ge Zhang, Bohao Yang, Chenghua Lin, Shi Wang, Anton Ragni, and Jie Fu. 2022. Herb: Measuring hierarchical regional bias in pre-trained language models. arXiv preprint arXiv:2211.02882.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Liu et al. (2023) Yuhan Liu, Zhaoxuan Tan, Heng Wang, Shangbin Feng, Qinghua Zheng, and Minnan Luo. 2023. Botmoe: Twitter bot detection with community-aware mixtures of modal-specific experts. arXiv preprint arXiv:2304.06280.
  • Loewenstein (1994) George Loewenstein. 1994. The psychology of curiosity: A review and reinterpretation. Psychological bulletin, 116(1):75.
  • Lv et al. (2021) Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. 2021. Are we really making much progress? revisiting, benchmarking and refining heterogeneous graph neural networks. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 1150–1160.
  • Misra (2019) Rishabh Misra. 2019. Imdb spoiler dataset.
  • Nadeem et al. (2020) Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15, pages 593–607. Springer.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. 2017. Graph attention networks. stat, 1050(20):10–48550.
  • Wan et al. (2019) Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. 2019. Fine-grained spoiler detection from large-scale review corpora. arXiv preprint arXiv:1905.13416.
  • Wang et al. (2023) Heng Wang, Wenqian Zhang, Yuyang Bai, Zhaoxuan Tan, Shangbin Feng, Qinghua Zheng, and Minnan Luo. 2023. Detecting spoilers in movie reviews with external movie knowledge and user networks. arXiv preprint arXiv:2304.11411.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489.

Appendix A Data Details

Refer to caption
Refer to caption
Refer to caption
Figure 6: (a) The spoiler proportion of reviews with different ratings; (b) The spoiler proportion of reviews posted in different years; (c) The spoiler proportion of reviews in different lengths;
Refer to caption
Figure 7: The proportion of spoiler reviews per user in 2 datasets, LCS and Kaggle. Spoiler review percentage intervals are divided every 10 percent.

Table 4 and Table 5 show the metadata details of LCS and Kaggle datasets, respectively.

We further investigate the correlation between spoilers and review ratings, publication time, and length, as depicted in Figure 6. Notable patterns emerge from our investigation:

  • Spoiler reviews are often poorly rated. Highly rated reviews often reveal little about the plot.

  • Spoiler reviews proportion in the early and recent years is low. A large number of reviews from around 2009-2016 are filled with spoilers.

  • Longer reviews are more likely to include spoilers, suggesting that the presence of spoilers increases as the length of the review expands.

We also show the proportion of spoiler reviews per user in the 2 datasets in Figure 7. It is obvious that most users are concentrated on both ends, that is, they either barely publish spoiler reviews or publish them frequently, thus have a clear tendency.

Table 4: Details of metadata contained in Kaggle.
Entity Name Metadata
User badge count, review count, description length
Movie year, isAdult, runtime, rating, vote count, synopsis length
Review time, helpful vote count, total vote count, point, content length
Table 5: Details of metadata contained in LCS.
Entity Name Metadata
Movie year, runtime, rating, synopsis length
Review time, point, content length
Table 6: Hyperparameter settings of MMoE.
Hyperparameter Language Model User Transformer Backbone Network
optimizer AdamW AdamW AdamW
learning rate 1e-5 1e-5 1e-4
lr scheduler WarmUpLinear Exponential Exponential
warm up/gamma 0.1 0.9 0.95
weight decay 1e-3 1e-5 1e-4
epochs 1 20 60
dropout 0.1 0.1 0.2
w \\\backslash\ \\\backslash\ 1e-2
Table 7: Model architecture parameters of MMoE on LCS dataset.
Parameters Value
language model MLP hidden dim 3072
language model MLP out dim 768
User Transformer dim 768
User Transformer feedforward dim 3072
User Transformer number of heads 12
User Transformer layers 12
User Transformer max length 16
meta dim 6
meta MLP hidden dim 768
meta MLP out dim 256
text projection dim 256
GNN input dim 774
GNN hidden dim 512
GNN out dim 256
GNN layers 2
number of experts 2
k 1
MoE MLP hidden dim 1024
MoE MLP out dim 256
Fusion Transformer dim 256
Fusion Transformer feedforward dim 1024
Fusion Transformer number of heads 4
Fusion Transformer layers 4
Table 8: Model architecture parameters of MMoE on Kaggle dataset.
Parameters Value
language model MLP hidden dim 3072
language model MLP out dim 768
User Transformer dim 768
User Transformer feedforward dim 3072
User Transformer number of heads 12
User Transformer layers 12
User Transformer max length 4
meta dim 4
meta MLP hidden dim 768
meta MLP out dim 256
text projection dim 256
GNN input dim 772
GNN hidden dim 512
GNN out dim 256
GNN layers 2
number of experts 4
k 1
MoE MLP hidden dim 1024
MoE MLP out dim 256
Fusion Transformer dim 256
Fusion Transformer feedforward dim 1024
Fusion Transformer number of heads 4
Fusion Transformer layers 4

Appendix B Hyperparameters

Table 6 illustrates the hyperparameter settings in the experiments. Table 7 and Table 8 demonstrate detailed model architecture parameters for easy reproduction.

Appendix C Experiment Details

  • We use Neighbor Loader in Pytorch Geometric library to sample review nodes in the graph. We set the maximum number of neighbors to 200 and sample the 2-hop subgraph.

  • We pad the metadata to the same dimension with -1.

  • The Kaggle dataset doesn’t provide the description of users. This situation further highlights the value of our user profile extraction module because it extracts user profiles from reviews. For GNN-based methods, we use zero vectors as the user’s initial embedding. For our method MMoE, we set the first token of the sequence as learnable parameters, which is similar to the CLS token of BERT Devlin et al. (2018).

Appendix D Baseline Details

We compare MMoE with PLM-based methods, GNN-based methods, and task-specific methods to ensure a holistic evaluation. For pre-trained language models, we pass the review text to the model, average all tokens, and adopt two linear projection layers to classify. For GNN-based methods, we pass the review text to RoBERTa, averaging all tokens to get the initial node feature. We provide a brief description of each of the baseline methods, in the following.

  • BERT Devlin et al. (2018) is a pre-trained language model which uses masked language model and next sentence prediction tasks to train on a large amount of natural language corpus.

  • RoBERTa Liu et al. (2019) is an improvement model based on BERT which removes the next sentence prediction task and improves the masking strategies.

  • BART Lewis et al. (2019) is a pre-trained language model that improves upon traditional autoregressive models by incorporating bidirectional encoding and denoising objectives.

  • DeBERTa He et al. (2021) is an advanced language model that enhances BERT by introducing disentangled attention and enhanced mask decoder.

  • GCN Kipf and Welling (2016) is a basic graph neural network that effectively captures and propagates information across graph-structured data by performing convolutions on the graph’s nodes and their neighboring nodes.

  • R-GCN Schlichtkrull et al. (2018) is an extension of GCN that specifically handles multi-relational graphs by incorporating relation-specific weights.

  • GAT Velickovic et al. (2017) is a graph neural network that utilizes attention mechanisms to assign importance weights to neighboring nodes dynamically.

  • Simple-HGN Lv et al. (2021) is a graph neural network model designed for heterogeneous graphs, which effectively integrates multiple types of nodes and edges by employing a shared embedding space and adaptive aggregation strategies.

  • DNSD Chang et al. (2018) is a spoiler detection method using a CNN-based genre-aware attention mechanism.

  • SpoilerNet Wan et al. (2019) incorporates the hierarchical attention network (HAN) Yang et al. (2016) and the gated recurrent unit (GRU) Cho et al. (2014) with item and user bias terms for spoiler detection.

  • MVSD Wang et al. (2023) utilizes external movie knowledge and user networks to detect spoilers.