Anime Popularity Prediction Before Huge Investments: A Multimodal Approach Using Deep Learning

Jesús Armenta-Segura, Grigori Sidorov
Instituto Politécnico Nacional (IPN),
Centro de Investigación en Computación (CIC),
Mexico City, Mexico
{jarmentas2022, sidorov}@cic.ipn.mx

Abstract

In the japanese anime industry, predicting whether an upcoming product will be popular is crucial. This paper presents a dataset and methods on predicting anime popularity using a multimodal text-image dataset constructed exclusively from freely available internet sources. The dataset was built following rigorous standards based on real-life investment experiences. A deep neural network architecture leveraging GPT-2 and ResNet-50 to embed the data was employed to investigate the correlation between the multimodal text-image input and a popularity score, discovering relevant strengths and weaknesses in the dataset. To measure the accuracy of the model, mean squared error (MSE) was used, obtaining a best result of 0.0110.0110.0110.011 when considering all inputs and the full version of the deep neural network, compared to the benchmark MSE 0.4120.4120.4120.412 obtained with traditional TF-IDF and PILtotensor vectorizations. This is the first proposal to address such task with multimodal datasets, revealing the substantial benefit of incorporating image information, even when a relatively small model (ResNet-50) was used to embed them.

Keywords Anime  \cdot Entertainment  \cdot Regression  \cdot Multimodal  \cdot Computer Vision  \cdot Natural Language Processing  \cdot Popularity Prediction

Introduction

One of the most crucial aspects of the japanese animation industry, or anime industry, is the release of successful and profitable products. In order to achieve this, several techniques can be employed such as cultivating a fan base, where companies foments the popularity of the product among a certain demography through marketing campaigns. This practice ensures a basement of devoted customers, significantly reducing the sales risk. As a consequence, the development of successful popularity prediction systems can help investors on making better decisions and can helps animation houses to avoid financial catastrophes while creating relevant, profitable and successful franchises.

When addressing this task, it comes to light that there are no straightforward methods to predict the popularity of an incoming anime. For instance, the Toei Animation’s 1994 film Dragon Ball Z: Bio-Broly lost more than ¥9¥9\yen 9¥ 9 billion [1, 2], despite being part of the globally acclaimed Dragon Ball Z franchise and featuring Broly, one of its most popular antagonists. Conversely, MAPPA Studios’ 2018 anime Kimetsu No Yaiba, based on a relatively unknown manga not among the top 50 best-selling mangas in the Oricon 2018 ranking [3], achieved unprecedented success with its 2020 film Kimetsu no Yaiba: Mugen Train, setting a worldwide Guinness Record at the box office [4], even amidst the challenges posed by the Covid-19 pandemic. These two examples evidences that the understanding about the phenomenom is not yet enough to design deterministic approaches for efficient and accurate popularity predictors.

Another relevant constraint to be considered is the limited set of accessible features to design systems capable to predict popularity before making huge monetary investments. To begin with, animation has an expensive nature: even a small project such as the pilot of the indie animation HEATHENS is projected to cost $39,456currency-dollar39456\$39,456$ 39 , 456 USD (or $60,000currency-dollar60000\$60,000$ 60 , 000 AU) according to their crowdfunding webpage [5]. Consequently, investors sometimes must make decisions even before a script is written, as is common in other entertainment industries like Hollywood, where decisions are based on a concise four-line description of the future movie plot [6]. In the case of animation, these four-liners might be accompanied by brief sketches of the possible main characters along with a short description of them [7]. As a consequence, any system developed to predict popularity before huge investments must rely only in these limited set of features. In Figure 1 an example of such sketches is depicted.

a) Refer to caption b) Refer to caption

Figure 1: Visual representations of the anime character Faye Valentine, from Cowboy Bebop, during early stages of development. a) Her portrait in MyAnimeList. b) Her character sketch designs [8] with several poses, angles and facial emotions. More than this level of detail is required for further references for animators [7], although are not assessible during early stages of development.

In order to address all these quirks and constraints around the problem, this work proposes a statistical analysis fundamented on artificial intelligence (AI) methods, concretely with machine and deep learning models. In recent times, AI have become highly relevant in tackling problems of discover how a complex set of variables are related with a given dependent variable [9, 10, 11, 12, 13, 14, 15]. With respect of the popularity prediction task, one of the closest related AI problems is the development of recommender systems, which consists on methods that suggests the best possible products to potential users or customers [16]. When such systems are profile-based, they can determine the probability of a product being a suitable option for a consumer given their demographics or, in other words, how successful the product might be for that particular user/demographic. Although tackling the popularity prediction task through (adapted) recommender systems may be tempting, they focus on an large set of demographics which lead to a significant sparsity on their datasets [17]. Moreover, these datasets includes features only accessible once the product is released, such as Producers, Duration, or Casting. Such features have shown a strong correlation [14, 18], but are only available in the context of a recommender system for a streaming service, leading to the need of a slightly different approach from the beginning to the end of the problem statement.

As a consequence, the first challenges for the popularity prediction task is the design of a suitable dataset. Across the internet, several anime databases containing equivalent information are available: plot synopses as free substitutes for the four-liners and in-anime snapshots with fan-made descriptions as free substitutes for the main character sketches. The most relevant of these databases are AnimeNewsNetwork’s encyclopedia [19], which has more than 28,0002800028,00028 , 000 entries but detailed user ratings statistics for only a small subset of them; AniList [20], with more than 15,0001500015,00015 , 000 samples and detailed statistics about user ratings but strict policies against web scra**; and MyAnimeList [21], the oldest and most active platform (according to www.similarweb.com), encompassing more than 25,0002500025,00025 , 000 entries with detailed information about user ratings for a considerable subset of them. Moreover, MyAnimeList (MAL) does not have explicit policies against web crawling, at least until January 2024. Hence, a prudent approach to gathering freely available data must be MAL.

Once the dataset is obtained, the next challenge consists in the selection of a suitable AI model for the system. In previous experiences, there have been several proposals with other related entertainment industries, such as movies [22] or books [10] and even in the animation industries itself [23]. All these works have in common the employment of relatively small models with a low-to-medium complexity, but also to consider the task as a classification problem, with artificially crafted dependent variables in order to simplify the problem. From these experiences it is possible to distill the need of large and complex models as showed in [23], where the authors employed traditional classifiers over anime plot summaries and demonstrated the huge underlying complexity hidden on this kind of data, or as implicitly showed in [22], where the authors employed a two-branched neural network with ELMO and BiLSTM or CNN to predict binary movie success through plot summaries, but then enhanced their results by leveraging BERT-based models in [24].

In alignment with this distilled knowledge, this work opted for a deep neural network approach, leveraging GPT-2 [25] for text analysis and ResNET-50 [26] for image processing. Additionally, in order to obtain the most realistic and natural results possible, the artificially crafted classifications, useful for baselines purpuses but not too fitted in reality, are set aside in favor of a regression perspective. Furthermore, this work also presents a benchmark for this model utilizing traditional vectorizations, such as TF-IDF for texts and PILtotensor for images.

In the remaining sections of the paper, all described procedures are examined in detail. The obtention of the corpus is explained, alongside the experimental setup of a three-input deep neural network used to benchmark it. To evaluate the model’s performance, mean squared error (MSE) is employed to measure the error ratio of predictions. Additionally, Pearson, Spearman, and Kendall’s Tau correlation coefficients are utilized to study the impact of the processed features on the score. The best results were obtained when considering all inputs, yielding an MSE of 0.0110.0110.0110.011, a Spearman correlation coefficient of 0.4310.4310.4310.431, a Pearson correlation coefficient of 0.4360.4360.4360.436, and a Kendall’s Tau correlation coefficient of 0.2970.2970.2970.297.

Background

The Anime Corpus

For each anime in the MAL database, a python script scrapped its title, plot summary, weighted average score and all of its main character names, descriptions and portraits as shown in Figure 1a). All samples without this information were discarded. The scrap** process started at December 28, 2023, at 0:03 UTC, and finished at January 3, 2024, at 14:31 UTC. The script was implemented using the BeautifulSoup4 python library [27]. The final result was 11,8731187311,87311 , 873 animes with 21,3292132921,32921 , 329 main characters.

Once obtained the data, a clean process was performed. First, all characters with no useful descriptions, such as the text Nodescriptionavailable𝑁𝑜𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒No\ description\ availableitalic_N italic_o italic_d italic_e italic_s italic_c italic_r italic_i italic_p italic_t italic_i italic_o italic_n italic_a italic_v italic_a italic_i italic_l italic_a italic_b italic_l italic_e, were removed, as well as characters with no portrait. Then, all animes with no score, synopsis, title or associated main characters were also removed, including samples with plot summaries less than 20202020 words. This process reduced the scraped data to 7,78477847,7847 , 784 animes and 14,6821468214,68214 , 682 characters. The characteristics of the final dataset and its statistics are presented in Table 1 and Figure 2 with respect of their synopsis and in Table 2 and Figure 2 with respect of their main characters.

Table 1: Statistics for the corpus. Wordcount refers to synopsis.
Score Samples Max. Words Min. Words Avg. Words
12121-21 - 2 2222 68686868 55555555 61.5061.5061.5061.50
23232-32 - 3 6666 228228228228 25252525 112.50112.50112.50112.50
34343-43 - 4 11111111 166166166166 24242424 80.7280.7280.7280.72
45454-54 - 5 80808080 244244244244 25252525 78.8878.8878.8878.88
56565-65 - 6 925925925925 329329329329 24242424 83.4283.4283.4283.42
67676-76 - 7 3,13131313,1313 , 131 397397397397 24242424 97.3597.3597.3597.35
78787-87 - 8 3,02130213,0213 , 021 581581581581 24242424 121.30121.30121.30121.30
89898-98 - 9 596596596596 340340340340 29292929 149.68149.68149.68149.68
9109109-109 - 10 12121212 189189189189 133133133133 157.25157.25157.25157.25
Total 7,78477847,7847 , 784 581581581581 24242424 104.73104.73104.73104.73
Table 2: Statistics for the characters.
Total Characters Max. Words Min. Words Avg. Words
14,6821468214,68214 , 682 3,55135513,5513 , 551 4444 121.34121.34121.34121.34
Refer to caption
Refer to caption
Figure 2: Mean Synopsis Wordcount left) and Mean Charcter Frequency (right) across the dataset. In both Figures the X-axis represents the floor score. Y-axis is the mean words on synopsis per score and mean amount of main characters per score.

The MAL Weighted Average Score as Golden Label

The most notable MAL statistic for popularity measuring is the weighted average score, calculated in terms of all user ratings of each anime. This score [28] works straightforward: a user of the database can ranks an anime in a 00-out-of-10101010 scale, based on their particular opinion about it. To avoid bots, the system requires the user to watch at least a fifth part of the anime before score it, which can be done by manually marking the episodes as already seen. From this, it is possible to propose a naive measure for liking as the mean of all rankings:

S=Sum of all users scores of the animev.𝑆Sum of all users scores of the anime𝑣S=\frac{\textrm{Sum of all users scores of the anime}}{v}.italic_S = divide start_ARG Sum of all users scores of the anime end_ARG start_ARG italic_v end_ARG . (1)

Where v𝑣vitalic_v is the total number of voters. However, this measure can be biased since does not consider the statistical relevance of the population who scored it. An example of a biased score can be a hypothetical incoming anime who receive all of its first ratings from its producers, which might be interested to give generous scores, regardless the true scope of their product. To tackle this bias, MAL weighted S𝑆Sitalic_S in terms of how many people has watched and scored the anime. They defined a statistical bound m=50𝑚50m=50italic_m = 50 and they defined the weight of S𝑆Sitalic_S as:

s=(vv+m).𝑠𝑣𝑣𝑚s=\left(\frac{v}{v+m}\right).italic_s = ( divide start_ARG italic_v end_ARG start_ARG italic_v + italic_m end_ARG ) . (2)

Hence, if more people scores the anime, s𝑠sitalic_s tends to 1111 and the relevance of S𝑆Sitalic_S increases. When v=1𝑣1v=1italic_v = 1, the weight reaches its minimum nonzero value 1/(m+1)1𝑚11/(m+1)1 / ( italic_m + 1 ), who is also the average statistical importance of a single rating.

MAL also weights the general importance of their whole community, by calculating twice per day the follow default score, based on all rankings across the database.

C=Sum of all valid scores in the databaseTotal ammount of valid scores in the database.𝐶Sum of all valid scores in the databaseTotal ammount of valid scores in the databaseC=\frac{\textrm{Sum of all valid scores in the database}}{\textrm{Total % ammount of valid scores in the database}}.italic_C = divide start_ARG Sum of all valid scores in the database end_ARG start_ARG Total ammount of valid scores in the database end_ARG . (3)

The value of C𝐶Citalic_C when the data scrap** finished (Jan 3, 2024) was 6.6056.6056.6056.605. This default score represents a very coarse qualification for an incoming anime, given that nobody watched it or its fandom is statistically insignificant. Its weight is defined as follows:

c=(mv+m).𝑐𝑚𝑣𝑚c=\left(\frac{m}{v+m}\right).italic_c = ( divide start_ARG italic_m end_ARG start_ARG italic_v + italic_m end_ARG ) . (4)

When more people scores the anime, C𝐶Citalic_C lose relevance in the score. In other words, when s𝑠sitalic_s the weight of the scores given by the users grows, c𝑐citalic_c tends to 00.

Finally, the weighted averaged score of an anime is defined as follows:

W=(vv+m)S+(mv+m)C.𝑊𝑣𝑣𝑚𝑆𝑚𝑣𝑚𝐶W=\left(\frac{v}{v+m}\right)S+\left(\frac{m}{v+m}\right)C.italic_W = ( divide start_ARG italic_v end_ARG start_ARG italic_v + italic_m end_ARG ) italic_S + ( divide start_ARG italic_m end_ARG start_ARG italic_v + italic_m end_ARG ) italic_C . (5)

This shows the high quality of MAL metrics for popularity measuring. By considering that MAL gather members all around the world and is the most visited anime portal in all internet, any output generated by a method trained with this dataset should be interpreted as popularity across (a huge part of) the internet. However, since this score does not consider the demographic statistics of the voters, it may be not a suitable measure for a profile-based recommender system.

The Deep Neural Network

As stated in the introduction, three-input deep neural network was designed to solve the regression task of predicting the MAL Weighted Average Score given the synopsis, main character portrait and description. In Table 3 an example of the full input of an anime is depicted. For each input, a sequential set of layers was employed:

Synopsis (with wordcount)
Robin Sena is a powerful craft user drafted into the STNJ —a group of specialized hunters that
fight deadly beings known as Witches. Though her fire power is great, she’s got a lot to learn
about her powers and working with her cool and aloof partner, Amon. But the truth about the
Witches and herself will leave Robin on an entirely new path that she never expected! (66 words)
Main Characters
Name (with MAL ID) Description (with wordcount)
Robin Sena (299) Robin Sena is a soft-spoken 15-year-old girl with \dots (144 words)
Amon (300) Amon is a Hunter and is also Robin’s partner. (412 words)
Michael Lee (301) Michael is a hacker and the technical support \dots (133 words)
Haruto Sakaki (302) Haruto Sakaki is an 18-year-old Hunter working with the \dots (167 words)
Miho Karasuma (303) The second in command. Miho is a 19-year-old hunter \dots (73 words)
Yurika Doujima (304) Doujima is portrayed as as carefree, lazy, vain, and \dots (205 words)
Total wordcount: 1,134
Concatenated Portraits
[Uncaptioned image]
Table 3: Example of the input for the anime Which Hunter Robin with ID 7777 and Score 7.257.257.257.25.
  • Synopsis: The GPT-2 pretrained model. Output shape: 768768768768.

  • Main Character descriptions: The GPT-2 pretrained model. Output shape: 768768768768.

  • Main Character portraits: The ResNET-50 pretrained model, flattened at the end. Output shape: 7×7777\times 77 × 7.

Once embedded with their corresponding sequential layers, the Main Character inputs were concatenated and passed to a Multilayer Perceptron (MLP) whose specifications are depicted in Table 4 and Figure 5. The design of this MLP aims to concatenate both Main Character inputs and to process them into a unified 768768768768-dimentional embedding.

Layer Name Type Input Shape Output Shape Act. Funct. Connects with
First Dropout (0.1)0.1(0.1)( 0.1 ) 768+4976849768+49768 + 49 768+4976849768+49768 + 49 Second
Second Linear 768+4976849768+49768 + 49 768768768768 TanH Third
Third Dropout (0.1)0.1(0.1)( 0.1 ) 768768768768 768768768768 Fourth
Fourth Linear 768768768768 768768768768 TanH (output layer)
Table 4: Specifications of the MLP for Main Characters embeddings.

Finally, the main character output is concatenated with the synopsis GPT-2 embeddings and passed through a larger MLP (specifications in Table 6), designed to gradually reduce the dimention to convert it into a singleton suitable for regression.

Benchmark methods

In order to evaluate whether the results from the deep neural network are good, the dataset is also benchmarked with traditional machine learning methods, such as in [23]. With this purpose, all texts were vectorized through the Term Frequency-Inverse Document Frequency (TF-IDF) measure with the Scikit-Learn implementation [29], while images were tensorized through the PILtotensor method from the Python Imaging Library [20]. The result outputs were truncated to size 750750750750 and then concatenated into a 2250225022502250-dimentional tensor which was then feeded to a simple MLP described in Table 5.

Layer Name Type Input Shape Output Shape Act. Funct. Connects with
First Linear 750+750+750750750750750+750+750750 + 750 + 750 1000100010001000 TanH Second
Second Linear 1000100010001000 500500500500 TanH Third
Third Linear 500500500500 250250250250 TanH Fourth
Fourth Linear 250250250250 100100100100 TanH LAST
LAST Linear 100100100100 1111 SoftMax (logits)
Table 5: Specifications of the MLP for the traditional methods.

Experiments

In order to perform experiments, the dataset was train-test splitted in a 81:100:8110081:10081 : 100 proportion. The reason behind that specific ratio is that several animes can share main characters. For instance, in Table 7, seven out of the fifty four animes in which the main character Monkey D. Luffy (ID 40) appears are depicted, along with the frequencies of four of his nakamas (companions): Roronoa Zoro (ID 62), Nami (ID 723), Usopp (ID 724) and Vinsmoke Sanji (ID 305). As a consequence, it is important to ensure that all animes with shared main character belongs to the same set so all test samples will corresponds to totally unseen data.

Layer Name Type Input Shape Output Shape Act. Funct. Connects with
First Dropout (0.1)0.1(0.1)( 0.1 ) 768+768768768768+768768 + 768 768+768768768768+768768 + 768 Second
Second Linear 768+768768768768+768768 + 768 768768768768 TanH Third
Third Linear 768768768768 384384384384 TanH Fourth
Fourth Linear 384384384384 192192192192 TanH Fifth
Fifth Linear 192192192192 96969696 TanH Sixth
Sixth Linear 96969696 48484848 ReLU Seventh
Seventh Linear 48484848 24242424 ReLU Eighth
Eighth Linear 24242424 12121212 ReLU Ninth
Ninth Linear 12121212 6666 ReLU Tenth
Tenth Linear 6666 3333 ReLU LAST
LAST Linear 3333 1111 SoftMax (logits)
Table 6: Specifications of the MLP for classification.
Anime (with ID) Luffy Zoro Nami Usopp Sanji
One Piece (21) X X X X X
One Piece Film: Red (50410) X X X X X
One Piece Movie 14: Stampede (38,234) X X X X X
One Piece: Taose! Kaizoku Ganzack (466) X X X - -
One Piece 3D2Y: … (25,161) X - - - -
One Piece: Romance Dawn Story (5,252) X - - - -
One Piece: Cry Heart (22,661) X - - - -
Table 7: As an example of several main characters who appears across several animes (as main characters), this table shows seven animes (out of 54) in which at least one crewmate of Monkey D. Luffy (joined during the East Blue saga) appears.

Train-Test splitting

In order to make the custom split, all animes were grouped into clusters with respect of their shared main characters: if two animes shared a character, they were collocated in the same cluster. It is worth to note that, for each cluster, it is possible to find two animes with no shared characters, but a chain of animes who shares characters and who connects them. For that reason, the next recursive algorithm was employed to generate the clusters, obtaining 4,08940894,0894 , 089 (out of 7,78477847,7847 , 784 samples):

  • For each anime, get all the other animes who shares a main character. Assign to all of them a number (cluster name).

  • Make a second pass across all the dataset. This time, if two animes shares a cluster name, assign a new cluster name to them.

  • Repeat this algorithm until all animes have associated a single cluster name.

Finally, the train-test splitting was performed randomly, but each cluster was completely contained withing a single split. The training set has 6,34563456,3456 , 345 samples (81.5%percent81.581.5\%81.5 %) while the test set has 1,43914391,4391 , 439 samples (18.5%percent18.518.5\%18.5 %). Fortunately, despite the random splitting, both sets obtained very similar statistics, as evidenced in Table 8 and Figure 3 for the synopsis wordcount, and Table 9 and Figure 4 for the character wordcount.

Table 8: Statistics for the training and test set.
Training Set
Score Samples Max. Words Min. Words Avg. Words
12121-21 - 2 1(0.01%)1percent0.011\ (0.01\%)1 ( 0.01 % ) 55555555 55555555 55555555
23232-32 - 3 5(0.07%)5percent0.075\ (0.07\%)5 ( 0.07 % ) 228228228228 25252525 124.6124.6124.6124.6
34343-43 - 4 9(0.14%)9percent0.149\ (0.14\%)9 ( 0.14 % ) 166166166166 24242424 82.2282.2282.2282.22
45454-54 - 5 64(1%)64percent164\ (1\%)64 ( 1 % ) 244244244244 25252525 75.5375.5375.5375.53
56565-65 - 6 749(11.87%)749percent11.87749\ (11.87\%)749 ( 11.87 % ) 329329329329 24242424 81.481.481.481.4
67676-76 - 7 2,539(40%)2539percent402,539\ (40\%)2 , 539 ( 40 % ) 384384384384 24242424 97.5497.5497.5497.54
78787-87 - 8 2,487(39.19%)2487percent39.192,487\ (39.19\%)2 , 487 ( 39.19 % ) 581581581581 25252525 121.14121.14121.14121.14
89898-98 - 9 483(7.6%)483percent7.6483\ (7.6\%)483 ( 7.6 % ) 340340340340 29292929 151.38151.38151.38151.38
9109109-109 - 10 8(0.12%)8percent0.128\ (0.12\%)8 ( 0.12 % ) 188188188188 134134134134 157.75157.75157.75157.75
Total 6,345(100%)6345percent1006,345\ (100\%)6 , 345 ( 100 % ) 581581581581 24242424 105.17105.17105.17105.17
Test Set
12121-21 - 2 1(0.07%)1percent0.071\ (0.07\%)1 ( 0.07 % ) 68686868 68686868 68686868
23232-32 - 3 1(0.07%)1percent0.071\ (0.07\%)1 ( 0.07 % ) 52525252 52525252 52525252
34343-43 - 4 2(0.14%)2percent0.142\ (0.14\%)2 ( 0.14 % ) 87878787 61616161 74747474
45454-54 - 5 16(1.11%)16percent1.1116\ (1.11\%)16 ( 1.11 % ) 202202202202 27272727 92.3192.3192.3192.31
56565-65 - 6 176(12.23%)176percent12.23176\ (12.23\%)176 ( 12.23 % ) 254254254254 26262626 92.0392.0392.0392.03
67676-76 - 7 592(41.13%)592percent41.13592\ (41.13\%)592 ( 41.13 % ) 397397397397 25252525 96.5696.5696.5696.56
78787-87 - 8 534(37.10%)534percent37.10534\ (37.10\%)534 ( 37.10 % ) 277277277277 24242424 122.07122.07122.07122.07
89898-98 - 9 113(7.88%)113percent7.88113\ (7.88\%)113 ( 7.88 % ) 234234234234 35353535 142.39142.39142.39142.39
9109109-109 - 10 4(0.27%)4percent0.274\ (0.27\%)4 ( 0.27 % ) 189189189189 133133133133 156.25156.25156.25156.25
Total 1,439(100%)1439percent1001,439\ (100\%)1 , 439 ( 100 % ) 397397397397 24242424 99.5199.5199.5199.51
Table 9: Statistics for the characters.
Split Total Characters Max. Words Min. Words Avg. Words
Train Set 14,1681416814,16814 , 168 3,55135513,5513 , 551 4444 120.51120.51120.51120.51
Test Set 3,28832883,2883 , 288 2,27422742,2742 , 274 4444 124.84124.84124.84124.84
Total 14,6821468214,68214 , 682 3,55135513,5513 , 551 4444 121.34121.34121.34121.34
Refer to caption
Refer to caption
Figure 3: Mean Synopsis Wordcount in the training (left) and test (right) set. X-axis represents the floor score. Y-axis is the mean words on synopsis per score.
Refer to caption
Refer to caption
Figure 4: Mean Character Wordcount in the training (left) and test (right) set. X-axis represents the floor score. Y-axis is the mean characters per score.

Experimental Setup

All neural networks were implemented with PyTorch 1.10.1 [30] and run on an NVIDIA Quadro RTX 6000 GPU with 46 GB of VRAM. In the case of the three-input deep neural network, the neural network used the Huggingface models [31, 32]. Five experiments were conducted: the benchmark with the traditional vectorizations (Trad) and the three-input deep neural network with all inputs (Full), with only synopsis (GPT-2 + MLP), only portraits (ResNET-50 + MLP), and only descriptions (GPT-2 + MLP). For synopsis and descriptions (text inputs), the large MLP depicted in Table 6 was modified by removing the First layer. For portraits (image inputs), the small MLP depicted in Table 4 served as the base, with two additional linear layers with ReLU activation functions and output shapes of 384384384384 and 1111 respectively, added for regression purposes. Finally, for the traditional methods, the MLP depicted in Table 5 was used. All experiments utilized the same hyperparameters depicted in Table 10, except for the epochs, since the experiment Trad employed 30303030 instead of 5555 as the other experiments. Additionally, the scores were scaled to range between 00 and 1111, with 00 assigned to the minimum score and 1111 to the maximum.

Refer to caption
Figure 5: Full Three-input Deep Neural Network

Results

Results are depicted in Table 11. Learning curves for each experiment can be found in Figure 6. As anticipated in the Introduction, the superior model was Full with the best metrics and the best fitting on the learning curve, without either overfitting or underfitting after the first epoch. All non-traditional experiments outperformed the MSE benchmark, but only images were not capable to surpass the benchmark correlation coefficients.

Parameter Value
Seed 42424242.
Synopsis Pretrained Model ’GPT2’.
Char. Desc. Pretrained Model ’GPT2’.
Image Pretrained Model ’microsoft/resnet-50’.
Synopsis Tokenizer Max. Length 128128128128 (GPT2tokenizer).
Char. Desc. Tokenizer Max. Length 256256256256 (GPT2tokenizer).
Image Processors Parameters Default (AutoImageProcessor).
Optimizer Adam with weight decay (AdamW).
Opt. Learning Rate 5e25𝑒25e-25 italic_e - 2.
Opt. Epsilon param 1e81𝑒81e-81 italic_e - 8.
Loss Function Mean Squared Error (MSEloss).
Batch Size 16161616.
Epochs 5555 (30303030 for traditional vectorizations).
Table 10: All employed parameters in the experiments.
Popularity Prediction
Architecture Input MSE Spearman Pearson Kendall’s Tau
Full All 0.0110.0110.0110.011 0.431 0.436 0.2970.2970.2970.297
GPT-2+MLP Syn. 0.0120.0120.0120.012 0.338 0.328 0.2300.2300.2300.230
GPT-2+MLP Char. (Desc.) 0.0120.0120.0120.012 0.307 0.341 0.2100.2100.2100.210
ResNET-50+MLP Char. (Img.) 0.0280.0280.0280.028 0.0960.0960.0960.096 0.1210.1210.1210.121 0.0650.0650.0650.065
Trad (benchmark) All 0.4120.4120.4120.412 0.1950.1950.1950.195 0.1830.1830.1830.183 0.1300.1300.1300.130
Table 11: All experiments, sorted by mean squared error (MSE): the lower the value, the best the result. Moderate correlations are highligted with bold.

Discussion

Recall that the MSE is calculated with the square values of the differences between the real and the predicted scores. Hence, a value closer to 00 might seem as the best scenario possible, as long as the learning curve does not evidence overfitting or underfitting.

As evidenced in Figure 6, all models with not full inputs evidenced underfitting, being the images (the smallest model) the most critical case. Recall that underfits happens when the model is not big enough to capture the complexity of the dataset. Hence, this learning curves confirms the hypothesis stated in the introduction about the requirement of larger, deeper and complex models. This is also promising about the employment of LLMs such as Llama2 [33] or GPT4 [34].

Correlations

According to [35], it is possible to interpretate the correlation values according to the follow rule of thumb: 00.2900.290-0.290 - 0.29 indicates a small correlation, 0.300.490.300.490.30-0.490.30 - 0.49 indicates a moderate correlation and 0.5010.5010.50-10.50 - 1 indicates a strong correlation. Fine-graining, in [36], the authors used another rule of thumb for the Spearman correlation coefficient: 00.1900.190\textendash 0.190 – 0.19: very weak; 0.2.390.2.390.2\textendash.390.2 – .39: weak; 0.40.590.40.590.4\textendash 0.590.4 – 0.59: moderate; 0.60.790.60.790.6\textendash 0.790.6 – 0.79: strong; 0.810.810.8\textendash 10.8 – 1: very strong. In this work, the follow hybrid rule of thumb is considered:

  • 00.1900.190-0.190 - 0.19 for a very weak correlation.

  • 0.20.290.20.290.2-0.290.2 - 0.29 for a weak correlation.

  • 0.30.490.30.490.3-0.490.3 - 0.49 for a moderate correlation.

  • 0.510.510.5-10.5 - 1 for a strong correlation.

All the inputs together demonstrated a moderate correlation with the MAL score. Images showed the weakest correlation, supporting more the need of larger models. Not surprisingly, only text-based inputs showed a similar moderate correlation.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Learning curves for all experiments. From top-left to bottom-right: All inputs (good fit), Syn+MLP (slightly underfit), Char+MLP (slightly underfit), Img+MLP (notable underfit) and traditional methods (catastrophic underfit, as expected due the size and complexity of the dataset).

Fine-grained analysis of each metric shows that the moderate Pearson and Spearman correlations with text-based inputs suggests that the latent space generated by the larger MLP was moderately successful in linearizing the problem, indicating that it may be possible to completely linearize it with more complex and rich models. Adding images significatively strengthens this linear relationship (adding up 0.10.10.10.1 points), showing that main character visual information may be crucial for popularity. However, it is noteworthy that images by themselves had a very weak correlation, underscoring the essential role of text descriptions. Spearman, with a similar behaviour, also adds ranking information.

Finally, the Kendall’s Tau correlation helps to better understand the dataset as a whole. it is noteworthy that the minimum score on the dataset is 1.861.861.861.86 while the maximum is 9.069.069.069.06. The significantly lower results concerning this metric suggest that these outliers have a considerable impact during the training process. In Figure 2, the statistical difference between these outliers is evident, as well as in Table 1, where the two samples with scores between 1111 and 2222 have 5568556855-6855 - 68 words, while the twelve samples with scores between 9999 and 10101010 have 133189133189133-189133 - 189 words. These findings encourage the use of class weights in the logits to address such score outliers and alleviate related problems. However, this measure might be artificial and can lead to unexpected biases, as it is a product of a weakness in the dataset, so it was not implemented in this work.

A Memory Constraint with Transformer-based Models

Although the correlation metrics showed that main character descriptions have a considerable correlation with the labels, the proposed GPT-2 model presents the second most pronounced underfitting, only behind the main character portraits. Table 10 shows that the maximum length for the tokenizer of main character descriptions is 256256256256. However, some of these descriptions, when concatenated, can surpass 1,00010001,0001 , 000 words, as shown in the example in Table 3. Consequently, this underfitting can be explained by the loss of crucial information.

Due to the memory limitations of transformer-based models, it is not trivial to increase the tokenizer size. Therefore, further work should explore more memory-efficient methods capable of capturing all information with similar learning capabilities (e.g., Mamba [37]) or find other alternatives to include all main character information simultaneously (e.g., generating mean tensors and assessing how much relevant information is truly lost from doing that).

Conclusions

In this paper, one of the most robust free datasets for anime popularity prediction is introduced, relying solely on freely available internet data. To explore this dataset and propose a model for solving the task, a deep neural network leveraging GPT-2 and ResNET-50 was developed. The best MSE achieved was 0.0110.0110.0110.011, which is significantly lower than the benchmark MSE of 0.4120.4120.4120.412.

Moderate correlations were achieved with Spearman and Pearson metrics, while a low correlation with Kendall’s Tau revealed important quirks in the dataset, such as marginal representativity of animes with extreme scores. These issues should be taken into account when designing future models to solve this task, and also aligns with the claims in [17] that recommender systems can be significantly enhanced by solving the sparsity problem in data, also presents in non-free datasets.

The correlation coefficients also demonstrated that adding character portraits significantly enhances text-based inputs, even when embedded with a relatively small model such as ResNET-50.

Further work can be summarized as follows:

  • The memory problem with Main Character Descriptions: Truncating main character descriptions to 256256256256 tokens significantly impacts the results due the information lost, reflected in the underfitting. This problem, closely related with the memory constraint of transformer-based models, have these two paths to overcome it:

    • To experiment with other language models with less memory requirements. E.g. n-grams vectorizations or selective state spaces (e.g. Mamba).

    • To adjust the input to embed each main character individually. These individual tensors can then be processed in a miriad of ways, such as averaging them. The information lost after such processes is an important topic to be explored.

  • Larger models for processing the inputs: GPT-2 can be upgraded to GPT-4 or Llama2, and ResNET-50 can be enhanced to its bigger brother ResNET-152, a very deep pretrained CNN or a vision transformer such as Image-GPT. The moderate Pearson and Spearman coeficients indicates that leveraging larger models will significantly benefit the results.

Acknowledgments

The work was done with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, grants 20241816, 20241819, and 20240951 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.

References

  • [1] Box office of Dragon Ball Z: Bio-Broly. https://en.wikipedia.org/wiki/Dragon_Ball_Z:_Bio-Broly.
  • [2] Budget of Dragon Ball Z: Bio-Broly. https://jump.fandom.com/wiki/Dragon_Ball_Z:_Bio-Broly.
  • [3] Oricon 2018 ranking. Available at:. oricon.co.jp/confidence/special/52166/7/.
  • [4] Guiness record of Kimetsu no Yaiba. Available at: guinnessworldrecords.com/news/2023/5/fantasy-anime-demon-slayer-makes-history-with-incredible-box-office-record-750723/.
  • [5] HEATHENS’ Kickstater (snapshot taken at May 9 2024). Available at: web.archive.org/web/20240508180948/www.kickstarter.com/projects/heathens/heathens-indie-animated-pilot.
  • [6] Syd Field. Screenplay: The Foundations of Screenwriting. A Delta book. Delta Trade Paperbacks, 2005.
  • [7] Hideo Watanabe. Directing at anime/animation studios: Techniques and methods. Archiving Movements: Short Essays on Anime and Visual Media Materials V.2, page 4, 2020.
  • [8] Faye Valentine’s original sketches. Available at:. https://www.youtube.com/watch?v=CaXqcLNvnLk.
  • [9] Jesus Armenta-Segura, César Jesús Núnez-Prado, Grigori Olegovich Sidorov, Alexander Gelbukh, and Rodrigo Francisco Román-Godínez. Ometeotl@ multimodal hate speech event detection 2023: Hate speech and text-image correlation detection in real life memes using pre-trained bert models over text. In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, pages 53–59, 2023.
  • [10] Suraj Maharjan, Sudipta Kar, Manuel Montes, Fabio A. González, and Thamar Solorio. Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259–265, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  • [11] Joyeta Sharma and Abu Nowshed Chy. Exploiting web snippets for multi-label anime genre prediction. In 2021 IEEE 18th India Council International Conference (INDICON), pages 1–6, 2021.
  • [12] Satyendra Kumar Sharma, Swapnajit Chakraborti, and Tanaya Jha. Analysis of book sales prediction at amazon marketplace in india: a machine learning approach. Information Systems and e-Business Management, 17:261–284, 2019.
  • [13] Ruddy Théodose and Jean-Christophe Burie. Kangaiset: A dataset for visual emotion recognition on manga. In International Conference on Document Analysis and Recognition, pages 120–134. Springer, 2023.
  • [14] Xindi Wang, Burcu Yucesoy, Onur Varol, Tina Eliassi-Rad, and Albert-László Barabási. Success in books: predicting book sales before publication. EPJ Data Science, 8(1):1–20, 2019.
  • [15] Xing Wang, Shouhua Zhang, and Ivan Smetannikov. Fiction popularity prediction based on emotion analysis. In Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System, pages 169–175, 2020.
  • [16] Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender Systems: Techniques, Applications, and Challenges, pages 1–35. Springer US, New York, NY, 2022.
  • [17] Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Jundong Li, and Zi Huang. Self-supervised learning for recommender systems: A survey. IEEE Transactions on Knowledge and Data Engineering, 36(1):335–355, 2024.
  • [18] Sonu Airen and Jitendra Agrawal. Movie recommender system using parameter tuning of user and movie neighbourhood via co-clustering. Procedia Computer Science, 218:1176–1183, 2023. International Conference on Machine Learning and Data Engineering.
  • [19] Webpage of Anilist. Available at:. https://anilist.co/.
  • [20] Python Imaging Library (Pillow’s Fork). https://python-pillow.org/.
  • [21] Webpage of My Anime List. Available at: https://myanimelist.net/.
  • [22] You ** Kim, Yun Gyung Cheong, and Jung Hoon Lee. Prediction of a movie’s success from plot summaries using deep learning models. In Francis Ferraro, Ting-Hao ‘Kenneth’ Huang, Stephanie M. Lukin, and Margaret Mitchell, editors, Proceedings of the Second Workshop on Storytelling, pages 127–135, Florence, Italy, August 2019. Association for Computational Linguistics.
  • [23] Jesús Armenta-Segura and Grigori Sidorov. Anime Success Prediction Based on Synopsis Using Traditional Classifiers. In Proceedings of Congreso Mexicano de Inteligencia Artificial, COMIA, 2023.
  • [24] Jung-Hoon Lee, You-** Kim, and Yun-Gyung Cheong. Predicting quality and popularity of a movie from plot summary and character description using contextualized word embeddings. In 2020 IEEE Conference on Games (CoG), pages 214–220, 2020.
  • [25] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [26] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  • [27] Leonard Richardson. Beautiful soup 4. Documentation available at:, 2007. https://crummy.com/software/BeautifulSoup/.
  • [28] MAL’s weighted average score defined. Available at: https://myanimelist.net/info.php?go=topanime.
  • [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • [31] Huggingface implementation of GPT-2. Available at: https://huggingface.co/gpt2. .
  • [32] Huggingface implementation of ResNET-50. Available at: https://huggingface.co/microsoft/resnet-50.
  • [33] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
  • [34] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [35] Jacob Cohen. Statistical power analysis. Current directions in psychological science, 1(3), 1992.
  • [36] Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif M. Mohammad. What makes sentences semantically related? a textual relatedness dataset and empirical study. ArXiv, abs/2110.04845, 2021.
  • [37] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.