Anime Popularity Prediction Before Huge Investments: A Multimodal Approach Using Deep Learning

Jesús Armenta-Segura, Grigori Sidorov
Instituto Politécnico Nacional (IPN),
Centro de Investigación en Computación (CIC),
Mexico City, Mexico
{jarmentas2022, sidorov}@cic.ipn.mx

Abstract

In the japanese anime industry, predicting whether an upcoming product will be popular is crucial. This paper presents a dataset and methods on predicting anime popularity using a multimodal text-image dataset constructed exclusively from freely available internet sources. The dataset was built following rigorous standards based on real-life investment experiences. A deep neural network architecture leveraging GPT-2 and ResNet-50 to embed the data was employed to investigate the correlation between the multimodal text-image input and a popularity score, discovering relevant strengths and weaknesses in the dataset. To measure the accuracy of the model, mean squared error (MSE) was used, obtaining a best result of $0.011$ when considering all inputs and the full version of the deep neural network, compared to the benchmark MSE $0.412$ obtained with traditional TF-IDF and PILtotensor vectorizations. This is the first proposal to address such task with multimodal datasets, revealing the substantial benefit of incorporating image information, even when a relatively small model (ResNet-50) was used to embed them.

Keywords Anime $\cdot$ Entertainment $\cdot$ Regression $\cdot$ Multimodal $\cdot$ Computer Vision $\cdot$ Natural Language Processing $\cdot$ Popularity Prediction

Introduction

One of the most crucial aspects of the japanese animation industry, or anime industry, is the release of successful and profitable products. In order to achieve this, several techniques can be employed such as cultivating a fan base, where companies foments the popularity of the product among a certain demography through marketing campaigns. This practice ensures a basement of devoted customers, significantly reducing the sales risk. As a consequence, the development of successful popularity prediction systems can help investors on making better decisions and can helps animation houses to avoid financial catastrophes while creating relevant, profitable and successful franchises.

When addressing this task, it comes to light that there are no straightforward methods to predict the popularity of an incoming anime. For instance, the Toei Animation’s 1994 film Dragon Ball Z: Bio-Broly lost more than $\yen 9$ billion [1, 2], despite being part of the globally acclaimed Dragon Ball Z franchise and featuring Broly, one of its most popular antagonists. Conversely, MAPPA Studios’ 2018 anime Kimetsu No Yaiba, based on a relatively unknown manga not among the top 50 best-selling mangas in the Oricon 2018 ranking [3], achieved unprecedented success with its 2020 film Kimetsu no Yaiba: Mugen Train, setting a worldwide Guinness Record at the box office [4], even amidst the challenges posed by the Covid-19 pandemic. These two examples evidences that the understanding about the phenomenom is not yet enough to design deterministic approaches for efficient and accurate popularity predictors.

Another relevant constraint to be considered is the limited set of accessible features to design systems capable to predict popularity before making huge monetary investments. To begin with, animation has an expensive nature: even a small project such as the pilot of the indie animation HEATHENS is projected to cost $\$39,456$ USD (or $\$60,000$ AU) according to their crowdfunding webpage [5]. Consequently, investors sometimes must make decisions even before a script is written, as is common in other entertainment industries like Hollywood, where decisions are based on a concise four-line description of the future movie plot [6]. In the case of animation, these four-liners might be accompanied by brief sketches of the possible main characters along with a short description of them [7]. As a consequence, any system developed to predict popularity before huge investments must rely only in these limited set of features. In Figure 1 an example of such sketches is depicted.

Refer to caption — Figure 1: Visual representations of the anime character Faye Valentine, from Cowboy Bebop, during early stages of development. a) Her portrait in MyAnimeList. b) Her character sketch designs [8] with several poses, angles and facial emotions. More than this level of detail is required for further references for animators [7], although are not assessible during early stages of development.

In order to address all these quirks and constraints around the problem, this work proposes a statistical analysis fundamented on artificial intelligence (AI) methods, concretely with machine and deep learning models. In recent times, AI have become highly relevant in tackling problems of discover how a complex set of variables are related with a given dependent variable [9, 10, 11, 12, 13, 14, 15]. With respect of the popularity prediction task, one of the closest related AI problems is the development of recommender systems, which consists on methods that suggests the best possible products to potential users or customers [16]. When such systems are profile-based, they can determine the probability of a product being a suitable option for a consumer given their demographics or, in other words, how successful the product might be for that particular user/demographic. Although tackling the popularity prediction task through (adapted) recommender systems may be tempting, they focus on an large set of demographics which lead to a significant sparsity on their datasets [17]. Moreover, these datasets includes features only accessible once the product is released, such as Producers, Duration, or Casting. Such features have shown a strong correlation [14, 18], but are only available in the context of a recommender system for a streaming service, leading to the need of a slightly different approach from the beginning to the end of the problem statement.

As a consequence, the first challenges for the popularity prediction task is the design of a suitable dataset. Across the internet, several anime databases containing equivalent information are available: plot synopses as free substitutes for the four-liners and in-anime snapshots with fan-made descriptions as free substitutes for the main character sketches. The most relevant of these databases are AnimeNewsNetwork’s encyclopedia [19], which has more than $28,000$ entries but detailed user ratings statistics for only a small subset of them; AniList [20], with more than $15,000$ samples and detailed statistics about user ratings but strict policies against web scra**; and MyAnimeList [21], the oldest and most active platform (according to www.similarweb.com), encompassing more than $25,000$ entries with detailed information about user ratings for a considerable subset of them. Moreover, MyAnimeList (MAL) does not have explicit policies against web crawling, at least until January 2024. Hence, a prudent approach to gathering freely available data must be MAL.

Once the dataset is obtained, the next challenge consists in the selection of a suitable AI model for the system. In previous experiences, there have been several proposals with other related entertainment industries, such as movies [22] or books [10] and even in the animation industries itself [23]. All these works have in common the employment of relatively small models with a low-to-medium complexity, but also to consider the task as a classification problem, with artificially crafted dependent variables in order to simplify the problem. From these experiences it is possible to distill the need of large and complex models as showed in [23], where the authors employed traditional classifiers over anime plot summaries and demonstrated the huge underlying complexity hidden on this kind of data, or as implicitly showed in [22], where the authors employed a two-branched neural network with ELMO and BiLSTM or CNN to predict binary movie success through plot summaries, but then enhanced their results by leveraging BERT-based models in [24].

In alignment with this distilled knowledge, this work opted for a deep neural network approach, leveraging GPT-2 [25] for text analysis and ResNET-50 [26] for image processing. Additionally, in order to obtain the most realistic and natural results possible, the artificially crafted classifications, useful for baselines purpuses but not too fitted in reality, are set aside in favor of a regression perspective. Furthermore, this work also presents a benchmark for this model utilizing traditional vectorizations, such as TF-IDF for texts and PILtotensor for images.

In the remaining sections of the paper, all described procedures are examined in detail. The obtention of the corpus is explained, alongside the experimental setup of a three-input deep neural network used to benchmark it. To evaluate the model’s performance, mean squared error (MSE) is employed to measure the error ratio of predictions. Additionally, Pearson, Spearman, and Kendall’s Tau correlation coefficients are utilized to study the impact of the processed features on the score. The best results were obtained when considering all inputs, yielding an MSE of $0.011$ , a Spearman correlation coefficient of $0.431$ , a Pearson correlation coefficient of $0.436$ , and a Kendall’s Tau correlation coefficient of $0.297$ .

Background

The Anime Corpus

For each anime in the MAL database, a python script scrapped its title, plot summary, weighted average score and all of its main character names, descriptions and portraits as shown in Figure 1a). All samples without this information were discarded. The scrap** process started at December 28, 2023, at 0:03 UTC, and finished at January 3, 2024, at 14:31 UTC. The script was implemented using the BeautifulSoup4 python library [27]. The final result was $11,873$ animes with $21,329$ main characters.

Once obtained the data, a clean process was performed. First, all characters with no useful descriptions, such as the text $No\ description\ available$ , were removed, as well as characters with no portrait. Then, all animes with no score, synopsis, title or associated main characters were also removed, including samples with plot summaries less than $20$ words. This process reduced the scraped data to $7,784$ animes and $14,682$ characters. The characteristics of the final dataset and its statistics are presented in Table 1 and Figure 2 with respect of their synopsis and in Table 2 and Figure 2 with respect of their main characters.

Table 1: Statistics for the corpus. Wordcount refers to synopsis.

Score	Samples	Max. Words	Min. Words	Avg. Words
$1-2$	$2$	$68$	$55$	$61.50$
$2-3$	$6$	$228$	$25$	$112.50$
$3-4$	$11$	$166$	$24$	$80.72$
$4-5$	$80$	$244$	$25$	$78.88$
$5-6$	$925$	$329$	$24$	$83.42$
$6-7$	$3,131$	$397$	$24$	$97.35$
$7-8$	$3,021$	$581$	$24$	$121.30$
$8-9$	$596$	$340$	$29$	$149.68$
$9-10$	$12$	$189$	$133$	$157.25$
Total	$7,784$	$581$	$24$	$104.73$

Table 2: Statistics for the characters.

Total Characters	Max. Words	Min. Words	Avg. Words
$14,682$	$3,551$	$4$	$121.34$

The MAL Weighted Average Score as Golden Label

The most notable MAL statistic for popularity measuring is the weighted average score, calculated in terms of all user ratings of each anime. This score [28] works straightforward: a user of the database can ranks an anime in a $0$ -out-of- $10$ scale, based on their particular opinion about it. To avoid bots, the system requires the user to watch at least a fifth part of the anime before score it, which can be done by manually marking the episodes as already seen. From this, it is possible to propose a naive measure for liking as the mean of all rankings:

S=\frac{\textrm{Sum of all users scores of the anime}}{v}.

(1)

Where $v$ is the total number of voters. However, this measure can be biased since does not consider the statistical relevance of the population who scored it. An example of a biased score can be a hypothetical incoming anime who receive all of its first ratings from its producers, which might be interested to give generous scores, regardless the true scope of their product. To tackle this bias, MAL weighted $S$ in terms of how many people has watched and scored the anime. They defined a statistical bound $m=50$ and they defined the weight of $S$ as:

s=\left(\frac{v}{v+m}\right).

(2)

Hence, if more people scores the anime, $s$ tends to $1$ and the relevance of $S$ increases. When $v=1$ , the weight reaches its minimum nonzero value $1/(m+1)$ , who is also the average statistical importance of a single rating.

MAL also weights the general importance of their whole community, by calculating twice per day the follow default score, based on all rankings across the database.

C=\frac{\textrm{Sum of all valid scores in the database}}{\textrm{Total % ammount of valid scores in the database}}.

(3)

The value of $C$ when the data scrap** finished (Jan 3, 2024) was $6.605$ . This default score represents a very coarse qualification for an incoming anime, given that nobody watched it or its fandom is statistically insignificant. Its weight is defined as follows:

c=\left(\frac{m}{v+m}\right).

(4)

When more people scores the anime, $C$ lose relevance in the score. In other words, when $s$ the weight of the scores given by the users grows, $c$ tends to $0$ .

Finally, the weighted averaged score of an anime is defined as follows:

W=\left(\frac{v}{v+m}\right)S+\left(\frac{m}{v+m}\right)C.

(5)

This shows the high quality of MAL metrics for popularity measuring. By considering that MAL gather members all around the world and is the most visited anime portal in all internet, any output generated by a method trained with this dataset should be interpreted as popularity across (a huge part of) the internet. However, since this score does not consider the demographic statistics of the voters, it may be not a suitable measure for a profile-based recommender system.

The Deep Neural Network

As stated in the introduction, three-input deep neural network was designed to solve the regression task of predicting the MAL Weighted Average Score given the synopsis, main character portrait and description. In Table 3 an example of the full input of an anime is depicted. For each input, a sequential set of layers was employed:

[Uncaptioned image] — Table 3: Example of the input for the anime Which Hunter Robin with ID $7$ and Score $7.25$ .

Synopsis (with wordcount)
Name (with MAL ID)	Description (with wordcount)
Robin Sena (299)	Robin Sena is a soft-spoken 15-year-old girl with $\dots$ (144 words)
Amon (300)	Amon is a Hunter and is also Robin’s partner. (412 words)
Michael Lee (301)	Michael is a hacker and the technical support $\dots$ (133 words)
Haruto Sakaki (302)	Haruto Sakaki is an 18-year-old Hunter working with the $\dots$ (167 words)
Miho Karasuma (303)	The second in command. Miho is a 19-year-old hunter $\dots$ (73 words)
Yurika Doujima (304)	Doujima is portrayed as as carefree, lazy, vain, and $\dots$ (205 words)
Total wordcount: 1,134
Concatenated Portraits

•

Synopsis: The GPT-2 pretrained model. Output shape: $768$ .
•

Main Character descriptions: The GPT-2 pretrained model. Output shape: $768$ .
•

Main Character portraits: The ResNET-50 pretrained model, flattened at the end. Output shape: $7\times 7$ .

Once embedded with their corresponding sequential layers, the Main Character inputs were concatenated and passed to a Multilayer Perceptron (MLP) whose specifications are depicted in Table 4 and Figure 5. The design of this MLP aims to concatenate both Main Character inputs and to process them into a unified $768$ -dimentional embedding.

Layer Name	Type	Input Shape	Output Shape	Act. Funct.	Connects with
First	Dropout $(0.1)$	$768+49$	$768+49$	–	Second
Second	Linear	$768+49$	$768$	TanH	Third
Third	Dropout $(0.1)$	$768$	$768$	–	Fourth
Fourth	Linear	$768$	$768$	TanH	(output layer)

Table 4: Specifications of the MLP for Main Characters embeddings.

Finally, the main character output is concatenated with the synopsis GPT-2 embeddings and passed through a larger MLP (specifications in Table 6), designed to gradually reduce the dimention to convert it into a singleton suitable for regression.

Benchmark methods

In order to evaluate whether the results from the deep neural network are good, the dataset is also benchmarked with traditional machine learning methods, such as in [23]. With this purpose, all texts were vectorized through the Term Frequency-Inverse Document Frequency (TF-IDF) measure with the Scikit-Learn implementation [29], while images were tensorized through the PILtotensor method from the Python Imaging Library [20]. The result outputs were truncated to size $750$ and then concatenated into a $2250$ -dimentional tensor which was then feeded to a simple MLP described in Table 5.

Layer Name	Type	Input Shape	Output Shape	Act. Funct.	Connects with
First	Linear	$750+750+750$	$1000$	TanH	Second
Second	Linear	$1000$	$500$	TanH	Third
Third	Linear	$500$	$250$	TanH	Fourth
Fourth	Linear	$250$	$100$	TanH	LAST
LAST	Linear	$100$	$1$	SoftMax	(logits)

Table 5: Specifications of the MLP for the traditional methods.

Experiments

In order to perform experiments, the dataset was train-test splitted in a $81:100$ proportion. The reason behind that specific ratio is that several animes can share main characters. For instance, in Table 7, seven out of the fifty four animes in which the main character Monkey D. Luffy (ID 40) appears are depicted, along with the frequencies of four of his nakamas (companions): Roronoa Zoro (ID 62), Nami (ID 723), Usopp (ID 724) and Vinsmoke Sanji (ID 305). As a consequence, it is important to ensure that all animes with shared main character belongs to the same set so all test samples will corresponds to totally unseen data.

Layer Name	Type	Input Shape	Output Shape	Act. Funct.	Connects with
First	Dropout $(0.1)$	$768+768$	$768+768$	–	Second
Second	Linear	$768+768$	$768$	TanH	Third
Third	Linear	$768$	$384$	TanH	Fourth
Fourth	Linear	$384$	$192$	TanH	Fifth
Fifth	Linear	$192$	$96$	TanH	Sixth
Sixth	Linear	$96$	$48$	ReLU	Seventh
Seventh	Linear	$48$	$24$	ReLU	Eighth
Eighth	Linear	$24$	$12$	ReLU	Ninth
Ninth	Linear	$12$	$6$	ReLU	Tenth
Tenth	Linear	$6$	$3$	ReLU	LAST
LAST	Linear	$3$	$1$	SoftMax	(logits)

Table 6: Specifications of the MLP for classification.

Anime (with ID)	Luffy	Zoro	Nami	Usopp	Sanji
One Piece (21)	X	X	X	X	X
One Piece Film: Red (50410)	X	X	X	X	X
One Piece Movie 14: Stampede (38,234)	X	X	X	X	X
One Piece: Taose! Kaizoku Ganzack (466)	X	X	X	-	-
One Piece 3D2Y: … (25,161)	X	-	-	-	-
One Piece: Romance Dawn Story (5,252)	X	-	-	-	-
One Piece: Cry Heart (22,661)	X	-	-	-	-

Table 7: As an example of several main characters who appears across several animes (as main characters), this table shows seven animes (out of 54) in which at least one crewmate of Monkey D. Luffy (joined during the East Blue saga) appears.

Train-Test splitting

In order to make the custom split, all animes were grouped into clusters with respect of their shared main characters: if two animes shared a character, they were collocated in the same cluster. It is worth to note that, for each cluster, it is possible to find two animes with no shared characters, but a chain of animes who shares characters and who connects them. For that reason, the next recursive algorithm was employed to generate the clusters, obtaining $4,089$ (out of $7,784$ samples):

•

For each anime, get all the other animes who shares a main character. Assign to all of them a number (cluster name).
•

Make a second pass across all the dataset. This time, if two animes shares a cluster name, assign a new cluster name to them.
•

Repeat this algorithm until all animes have associated a single cluster name.

Finally, the train-test splitting was performed randomly, but each cluster was completely contained withing a single split. The training set has $6,345$ samples ( $81.5\%$ ) while the test set has $1,439$ samples ( $18.5\%$ ). Fortunately, despite the random splitting, both sets obtained very similar statistics, as evidenced in Table 8 and Figure 3 for the synopsis wordcount, and Table 9 and Figure 4 for the character wordcount.

Table 8: Statistics for the training and test set.

	Training Set
Score	Samples	Max. Words	Min. Words	Avg. Words
$1-2$	$1\ (0.01\%)$	$55$	$55$	$55$
$2-3$	$5\ (0.07\%)$	$228$	$25$	$124.6$
$3-4$	$9\ (0.14\%)$	$166$	$24$	$82.22$
$4-5$	$64\ (1\%)$	$244$	$25$	$75.53$
$5-6$	$749\ (11.87\%)$	$329$	$24$	$81.4$
$6-7$	$2,539\ (40\%)$	$384$	$24$	$97.54$
$7-8$	$2,487\ (39.19\%)$	$581$	$25$	$121.14$
$8-9$	$483\ (7.6\%)$	$340$	$29$	$151.38$
$9-10$	$8\ (0.12\%)$	$188$	$134$	$157.75$
Total	$6,345\ (100\%)$	$581$	$24$	$105.17$
	Test Set
$1-2$	$1\ (0.07\%)$	$68$	$68$	$68$
$2-3$	$1\ (0.07\%)$	$52$	$52$	$52$
$3-4$	$2\ (0.14\%)$	$87$	$61$	$74$
$4-5$	$16\ (1.11\%)$	$202$	$27$	$92.31$
$5-6$	$176\ (12.23\%)$	$254$	$26$	$92.03$
$6-7$	$592\ (41.13\%)$	$397$	$25$	$96.56$
$7-8$	$534\ (37.10\%)$	$277$	$24$	$122.07$
$8-9$	$113\ (7.88\%)$	$234$	$35$	$142.39$
$9-10$	$4\ (0.27\%)$	$189$	$133$	$156.25$
Total	$1,439\ (100\%)$	$397$	$24$	$99.51$

Table 9: Statistics for the characters.

Split	Total Characters	Max. Words	Min. Words	Avg. Words
Train Set	$14,168$	$3,551$	$4$	$120.51$
Test Set	$3,288$	$2,274$	$4$	$124.84$
Total	$14,682$	$3,551$	$4$	$121.34$

Experimental Setup

All neural networks were implemented with PyTorch 1.10.1 [30] and run on an NVIDIA Quadro RTX 6000 GPU with 46 GB of VRAM. In the case of the three-input deep neural network, the neural network used the Huggingface models [31, 32]. Five experiments were conducted: the benchmark with the traditional vectorizations (Trad) and the three-input deep neural network with all inputs (Full), with only synopsis (GPT-2 + MLP), only portraits (ResNET-50 + MLP), and only descriptions (GPT-2 + MLP). For synopsis and descriptions (text inputs), the large MLP depicted in Table 6 was modified by removing the First layer. For portraits (image inputs), the small MLP depicted in Table 4 served as the base, with two additional linear layers with ReLU activation functions and output shapes of $384$ and $1$ respectively, added for regression purposes. Finally, for the traditional methods, the MLP depicted in Table 5 was used. All experiments utilized the same hyperparameters depicted in Table 10, except for the epochs, since the experiment Trad employed $30$ instead of $5$ as the other experiments. Additionally, the scores were scaled to range between $0$ and $1$ , with $0$ assigned to the minimum score and $1$ to the maximum.

Results

Results are depicted in Table 11. Learning curves for each experiment can be found in Figure 6. As anticipated in the Introduction, the superior model was Full with the best metrics and the best fitting on the learning curve, without either overfitting or underfitting after the first epoch. All non-traditional experiments outperformed the MSE benchmark, but only images were not capable to surpass the benchmark correlation coefficients.

Parameter	Value
Seed	$42$ .
Synopsis Pretrained Model	’GPT2’.
Char. Desc. Pretrained Model	’GPT2’.
Image Pretrained Model	’microsoft/resnet-50’.
Synopsis Tokenizer Max. Length	$128$ (GPT2tokenizer).
Char. Desc. Tokenizer Max. Length	$256$ (GPT2tokenizer).
Image Processors Parameters	Default (AutoImageProcessor).
Optimizer	Adam with weight decay (AdamW).
Opt. Learning Rate	$5e-2$ .
Opt. Epsilon param	$1e-8$ .
Loss Function	Mean Squared Error (MSEloss).
Batch Size	$16$ .
Epochs	$5$ ( $30$ for traditional vectorizations).

Table 10: All employed parameters in the experiments.

Popularity Prediction
Architecture	Input	MSE	Spearman	Pearson	Kendall’s Tau
Full	All	$0.011$	0.431	0.436	$0.297$
GPT-2+MLP	Syn.	$0.012$	0.338	0.328	$0.230$
GPT-2+MLP	Char. (Desc.)	$0.012$	0.307	0.341	$0.210$
ResNET-50+MLP	Char. (Img.)	$0.028$	$0.096$	$0.121$	$0.065$
Trad (benchmark)	All	$0.412$	$0.195$	$0.183$	$0.130$

Table 11: All experiments, sorted by mean squared error (MSE): the lower the value, the best the result. Moderate correlations are highligted with bold.

Discussion

Recall that the MSE is calculated with the square values of the differences between the real and the predicted scores. Hence, a value closer to $0$ might seem as the best scenario possible, as long as the learning curve does not evidence overfitting or underfitting.

As evidenced in Figure 6, all models with not full inputs evidenced underfitting, being the images (the smallest model) the most critical case. Recall that underfits happens when the model is not big enough to capture the complexity of the dataset. Hence, this learning curves confirms the hypothesis stated in the introduction about the requirement of larger, deeper and complex models. This is also promising about the employment of LLMs such as Llama2 [33] or GPT4 [34].

Correlations

According to [35], it is possible to interpretate the correlation values according to the follow rule of thumb: $0-0.29$ indicates a small correlation, $0.30-0.49$ indicates a moderate correlation and $0.50-1$ indicates a strong correlation. Fine-graining, in [36], the authors used another rule of thumb for the Spearman correlation coefficient: $0\textendash 0.19$ : very weak; $0.2\textendash.39$ : weak; $0.4\textendash 0.59$ : moderate; $0.6\textendash 0.79$ : strong; $0.8\textendash 1$ : very strong. In this work, the follow hybrid rule of thumb is considered:

•

$0-0.19$ for a very weak correlation.
•

$0.2-0.29$ for a weak correlation.
•

$0.3-0.49$ for a moderate correlation.
•

$0.5-1$ for a strong correlation.

All the inputs together demonstrated a moderate correlation with the MAL score. Images showed the weakest correlation, supporting more the need of larger models. Not surprisingly, only text-based inputs showed a similar moderate correlation.

Fine-grained analysis of each metric shows that the moderate Pearson and Spearman correlations with text-based inputs suggests that the latent space generated by the larger MLP was moderately successful in linearizing the problem, indicating that it may be possible to completely linearize it with more complex and rich models. Adding images significatively strengthens this linear relationship (adding up $0.1$ points), showing that main character visual information may be crucial for popularity. However, it is noteworthy that images by themselves had a very weak correlation, underscoring the essential role of text descriptions. Spearman, with a similar behaviour, also adds ranking information.

Finally, the Kendall’s Tau correlation helps to better understand the dataset as a whole. it is noteworthy that the minimum score on the dataset is $1.86$ while the maximum is $9.06$ . The significantly lower results concerning this metric suggest that these outliers have a considerable impact during the training process. In Figure 2, the statistical difference between these outliers is evident, as well as in Table 1, where the two samples with scores between $1$ and $2$ have $55-68$ words, while the twelve samples with scores between $9$ and $10$ have $133-189$ words. These findings encourage the use of class weights in the logits to address such score outliers and alleviate related problems. However, this measure might be artificial and can lead to unexpected biases, as it is a product of a weakness in the dataset, so it was not implemented in this work.

A Memory Constraint with Transformer-based Models

Although the correlation metrics showed that main character descriptions have a considerable correlation with the labels, the proposed GPT-2 model presents the second most pronounced underfitting, only behind the main character portraits. Table 10 shows that the maximum length for the tokenizer of main character descriptions is $256$ . However, some of these descriptions, when concatenated, can surpass $1,000$ words, as shown in the example in Table 3. Consequently, this underfitting can be explained by the loss of crucial information.

Due to the memory limitations of transformer-based models, it is not trivial to increase the tokenizer size. Therefore, further work should explore more memory-efficient methods capable of capturing all information with similar learning capabilities (e.g., Mamba [37]) or find other alternatives to include all main character information simultaneously (e.g., generating mean tensors and assessing how much relevant information is truly lost from doing that).

Conclusions

In this paper, one of the most robust free datasets for anime popularity prediction is introduced, relying solely on freely available internet data. To explore this dataset and propose a model for solving the task, a deep neural network leveraging GPT-2 and ResNET-50 was developed. The best MSE achieved was $0.011$ , which is significantly lower than the benchmark MSE of $0.412$ .

Moderate correlations were achieved with Spearman and Pearson metrics, while a low correlation with Kendall’s Tau revealed important quirks in the dataset, such as marginal representativity of animes with extreme scores. These issues should be taken into account when designing future models to solve this task, and also aligns with the claims in [17] that recommender systems can be significantly enhanced by solving the sparsity problem in data, also presents in non-free datasets.

The correlation coefficients also demonstrated that adding character portraits significantly enhances text-based inputs, even when embedded with a relatively small model such as ResNET-50.

Further work can be summarized as follows:

•
The memory problem with Main Character Descriptions: Truncating main character descriptions to $256$ tokens significantly impacts the results due the information lost, reflected in the underfitting. This problem, closely related with the memory constraint of transformer-based models, have these two paths to overcome it:
- –
  
  To experiment with other language models with less memory requirements. E.g. n-grams vectorizations or selective state spaces (e.g. Mamba).
- –
  
  To adjust the input to embed each main character individually. These individual tensors can then be processed in a miriad of ways, such as averaging them. The information lost after such processes is an important topic to be explored.
•

Larger models for processing the inputs: GPT-2 can be upgraded to GPT-4 or Llama2, and ResNET-50 can be enhanced to its bigger brother ResNET-152, a very deep pretrained CNN or a vision transformer such as Image-GPT. The moderate Pearson and Spearman coeficients indicates that leveraging larger models will significantly benefit the results.

Acknowledgments

The work was done with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, grants 20241816, 20241819, and 20240951 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.

References

[1] Box office of Dragon Ball Z: Bio-Broly. https://en.wikipedia.org/wiki/Dragon_Ball_Z:_Bio-Broly.
[2] Budget of Dragon Ball Z: Bio-Broly. https://jump.fandom.com/wiki/Dragon_Ball_Z:_Bio-Broly.
[3] Oricon 2018 ranking. Available at:. oricon.co.jp/confidence/special/52166/7/.
[4] Guiness record of Kimetsu no Yaiba. Available at: guinnessworldrecords.com/news/2023/5/fantasy-anime-demon-slayer-makes-history-with-incredible-box-office-record-750723/.
[5] HEATHENS’ Kickstater (snapshot taken at May 9 2024). Available at: web.archive.org/web/20240508180948/www.kickstarter.com/projects/heathens/heathens-indie-animated-pilot.
[6] Syd Field. Screenplay: The Foundations of Screenwriting. A Delta book. Delta Trade Paperbacks, 2005.
[7] Hideo Watanabe. Directing at anime/animation studios: Techniques and methods. Archiving Movements: Short Essays on Anime and Visual Media Materials V.2, page 4, 2020.
[8] Faye Valentine’s original sketches. Available at:. https://www.youtube.com/watch?v=CaXqcLNvnLk.
[9] Jesus Armenta-Segura, César Jesús Núnez-Prado, Grigori Olegovich Sidorov, Alexander Gelbukh, and Rodrigo Francisco Román-Godínez. Ometeotl@ multimodal hate speech event detection 2023: Hate speech and text-image correlation detection in real life memes using pre-trained bert models over text. In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, pages 53–59, 2023.
[10] Suraj Maharjan, Sudipta Kar, Manuel Montes, Fabio A. González, and Thamar Solorio. Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259–265, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
[11] Joyeta Sharma and Abu Nowshed Chy. Exploiting web snippets for multi-label anime genre prediction. In 2021 IEEE 18th India Council International Conference (INDICON), pages 1–6, 2021.
[12] Satyendra Kumar Sharma, Swapnajit Chakraborti, and Tanaya Jha. Analysis of book sales prediction at amazon marketplace in india: a machine learning approach. Information Systems and e-Business Management, 17:261–284, 2019.
[13] Ruddy Théodose and Jean-Christophe Burie. Kangaiset: A dataset for visual emotion recognition on manga. In International Conference on Document Analysis and Recognition, pages 120–134. Springer, 2023.
[14] Xindi Wang, Burcu Yucesoy, Onur Varol, Tina Eliassi-Rad, and Albert-László Barabási. Success in books: predicting book sales before publication. EPJ Data Science, 8(1):1–20, 2019.
[15] Xing Wang, Shouhua Zhang, and Ivan Smetannikov. Fiction popularity prediction based on emotion analysis. In Proceedings of the 2020 1st International Conference on Control, Robotics and Intelligent System, pages 169–175, 2020.
[16] Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender Systems: Techniques, Applications, and Challenges, pages 1–35. Springer US, New York, NY, 2022.
[17] Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Jundong Li, and Zi Huang. Self-supervised learning for recommender systems: A survey. IEEE Transactions on Knowledge and Data Engineering, 36(1):335–355, 2024.
[18] Sonu Airen and Jitendra Agrawal. Movie recommender system using parameter tuning of user and movie neighbourhood via co-clustering. Procedia Computer Science, 218:1176–1183, 2023. International Conference on Machine Learning and Data Engineering.
[19] Webpage of Anilist. Available at:. https://anilist.co/.
[20] Python Imaging Library (Pillow’s Fork). https://python-pillow.org/.
[21] Webpage of My Anime List. Available at: https://myanimelist.net/.
[22] You ** Kim, Yun Gyung Cheong, and Jung Hoon Lee. Prediction of a movie’s success from plot summaries using deep learning models. In Francis Ferraro, Ting-Hao ‘Kenneth’ Huang, Stephanie M. Lukin, and Margaret Mitchell, editors, Proceedings of the Second Workshop on Storytelling, pages 127–135, Florence, Italy, August 2019. Association for Computational Linguistics.
[23] Jesús Armenta-Segura and Grigori Sidorov. Anime Success Prediction Based on Synopsis Using Traditional Classifiers. In Proceedings of Congreso Mexicano de Inteligencia Artificial, COMIA, 2023.
[24] Jung-Hoon Lee, You-** Kim, and Yun-Gyung Cheong. Predicting quality and popularity of a movie from plot summary and character description using contextualized word embeddings. In 2020 IEEE Conference on Games (CoG), pages 214–220, 2020.
[25] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[26] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
[27] Leonard Richardson. Beautiful soup 4. Documentation available at:, 2007. https://crummy.com/software/BeautifulSoup/.
[28] MAL’s weighted average score defined. Available at: https://myanimelist.net/info.php?go=topanime.
[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
[31] Huggingface implementation of GPT-2. Available at: https://huggingface.co/gpt2. .
[32] Huggingface implementation of ResNET-50. Available at: https://huggingface.co/microsoft/resnet-50.
[33] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
[34] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[35] Jacob Cohen. Statistical power analysis. Current directions in psychological science, 1(3), 1992.
[36] Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif M. Mohammad. What makes sentences semantically related? a textual relatedness dataset and empirical study. ArXiv, abs/2110.04845, 2021.
[37] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

Synopsis (with wordcount)
Robin Sena is a powerful craft user drafted into the STNJ —a group of specialized hunters that
fight deadly beings known as Witches. Though her fire power is great, she’s got a lot to learn
about her powers and working with her cool and aloof partner, Amon. But the truth about the
Witches and herself will leave Robin on an entirely new path that she never expected! (66 words)
Main Characters
Name (with MAL ID)	Description (with wordcount)
Robin Sena (299)	Robin Sena is a soft-spoken 15-year-old girl with $\dots$ (144 words)
Amon (300)	Amon is a Hunter and is also Robin’s partner. (412 words)
Michael Lee (301)	Michael is a hacker and the technical support $\dots$ (133 words)
Haruto Sakaki (302)	Haruto Sakaki is an 18-year-old Hunter working with the $\dots$ (167 words)
Miho Karasuma (303)	The second in command. Miho is a 19-year-old hunter $\dots$ (73 words)
Yurika Doujima (304)	Doujima is portrayed as as carefree, lazy, vain, and $\dots$ (205 words)
Total wordcount: 1,134
Concatenated Portraits