TextAge: A Curated and Diverse Text Dataset for Age Classification

Shravan Cheekati
[email protected]
   Mridul Gupta
[email protected]
   Vibha Raghu
[email protected]
   Pranav Raj
[email protected]

1 Introduction

In recent years, large language models (LLMs) and language-classification models have garnered significant attention due to advancements in Generative Deep Learning and Natural Language Processing abilities. However, these state-of-the-art (SOTA) models often follow a one-size-fits-all approach, generating and classifying text irrespective of the age of the speaker. To close the gap between human-generated and model-generated text, it is crucial to understand the differences in text based on the age of the producer. Moreover, the benefits of classifying text into an age group or producing text based on an age group are widespread.

Personalized services, such as chatbots, educational tools, healthcare services, and targeted advertisements, will be able to provide information in a way that is best understood by the target age group. The speaking or writing levels of young kids can be evaluated based on how similar their outputs are to that of their peers, or in other words, how likely their produced output belongs to their age group. A similar process can be followed by employers who want to test the language abilities of their employees. Finally, an application we were especially motivated by is Content Moderation. As kids under the age of 13 are rightfully restricted from social media usage, such an ability would be able to detect underage users on social media platforms worldwide.

We hypothesize that the current lack of this ability to classify text into an age group is due to issues surrounding relevant datasets, rather than insufficient model architectures since SOTA models are able to perform much more complex tasks. For example, to develop an underage detection model, it is crucial to have sufficient data from a younger age group, something that current datasets lack. Hence, we build a comprehensive and diverse dataset, TextAge, that maps a sentence to the age of the producer, age group of the producer, and whether the producer is underage (under 13). As opposed to existing datasets consisting of solely written text, on top of this, our focus was on collecting spoken data from our data sources. In section 2, we elaborate on Related Works. In Section 3, we detail our data collection, cleaning, and labeling process. In Section 4, we apply our dataset to tackle the task of Underage Detection, along with Generation Classification to test the general ability of SOTA models in classifying text by age group.

2 Related Works

Several dataset collections have limitations on their age ranges, motivating us to explore a variety of data sources. We were inspired by the CommonVoice dataset [1] which consists of speech data from various age groups, but it is limited to individuals over 20 years old and focused on audio classification. The CommonVoice dataset provides sentences for the participants to recite and collects their audio file, hence there is no focus on the context of their words but rather on auditory features. Our dataset is constructed from sources such as CHILDES, Meta, and the TV show ”Survivor” to cover a wide range of ages with a focus on context-based speech transcripts and sentences to effectively classify age.

Additional studies on how age can be determined from text can be studied such as Pennebaker and Stone [7] who used text samples from writers of different ages to study language changes across the lifespan, finding that older individuals used more positive and future-oriented language. Schwartz et al. [9] analyzed social media data to investigate how language use varies with age, discovering distinct linguistic patterns for different age groups. Nguyen et al. [5] used a large corpus of online forum posts (ex: Twitter/X) to study age-related language variation, identifying key linguistic features that change with age. These studies highlight the growing interest in understanding language patterns across age groups and the need for a comprehensive dataset to support this research.

Pentel [8] conducted a similar study in detecting age based on short 100-word texts written by children and adults in Estonian. Logistic Regression, Support Vector Machines, C4.5, k-Nearest Neighbor, Naïve Bayes, and Adaboost algorithms were compared against each other in performance, however the study was limited to written text rather than spoken transcripts. Blandin et al. [2] conducts a similar study that classifies the age at which a text is understood by a person, focusing more on reading comprehension rather than any speaking or casual conversation data points. Our contributions involve incorporating diverse data sources into a corpus to cover a wider range of ages and language styles to provide a structured dataset for studying age-related language patterns. This is crucial for advancing research in language development, age-related language patterns, and language-based age prediction.

3 Methods

3.1 Data Collection

CHILDES databank
To gather conversation data from a younger age group, we selected the CHILDES dataset which focuses specifically on child language development. This dataset was particularly beneficial because of its comprehensive collection of speech data from children. This data is spontaneous and already transcribed. Furthermore, variations of this dataset is well-established in linguistic research studies ensuring reliability of speech data and resting any ethical concerns regarding the use of child data since the transcribed data is completely anonymous.

Initially, we accessed the CHILDES database to retrieve text files containing transcriptions of child speech, focusing on those that met specific criteria related to age and language. Using Python scripts, we then extracted lines specifically attributed to the child speaker, labeled as ”CHI” in the transcripts.
Meta Casual Conversations Dataset
A primary source of a large number of data points originate from the Casual Conversation dataset composing of human annotated transcripts and audio files from over 3000 participants above the age of 18 [3]. Age, gender, apparent skin types, and lighting condition labels were provided by participants. The dataset’s intended use is for audio or visual classification as experimented in the DeepFake detection challenge. However, we utilize the transcript data to collect sentences from a wide range of ages.
Poki Poems-by-kids
This dataset consists of written poems written by grade levels 1 to 12 [4]. The lemmatized version of the data was used in order to maintain better normalization of words. Age was determined by taking the lower age range of the input grade. This makes it safer for the impactful use case of online underage detection e.g. a 7th grader would be around 12 to 13 years old so all 7th graders would be assumed to be 12 years old in which case they should not be on social media platforms or other online age restricted platforms.
JUSThink Dialogue and Actions Corpus
The JUSThink Dialogue and Actions Corpus (JUSThink) consists of a case study of a robot-mediated human-human collaborative learning activity named JUSThink where youth aged 9 through 12 are recorded and transcripted in their attempt to solve two graph related problems [6]. The specific ages are unable for this dataset so the average floored of 9 to 12 is taken, so the entire dataset of points map to 10 year olds. Since the entire group is under the 13 year old underage cutoff, we determined taking the average would be a better representative instead of flooring the age range to 9.
Survivor
As opposed to scra** from a scripted TV show, we ensured that we added to our dataset from a Reality TV show to collect authentic and real sentences spoken by the contestants and the host. We specifically chose Survivor as we found organized transcripts for each episode for the first 40 seasons of the show, resulting in around 160000 entries. For the age label, we manually found the age of each of the  20 contestants per season.

3.2 Data Cleaning

The data cleaning process was important in preparing the respective data sources for analysis, focusing on enhancing data quality and consistency. Initially, transcripts were retrieved as raw text and required significant cleaning to remove transcription errors, non-verbal cues, and other irrelevant metadata. To address this we developed cleaning scripts using Python, utilizing the Pandas and regex library. We used regular expressions to strip out non-alphabetic characters, correct punctuation misplacements, remove placeholder text, remove time stamps, and other annotation symbols. Additionally, any sentences reduced to single words or non-meaningful fragments were discarded to maintain analytical relevance. The cleaned data was then structured into two primary columns, ’Sentence’ and ’Age’, and merged from multiple sources into a single CSV file, ensuring a uniform dataset for subsequent analysis.

3.3 Data Exploration

On top of the two primary columns, we added two additional columns to our dataset: underage (True or False) and age-group (kids, teens, twenties, thirties, fourties, fifties, sixties, seventies, eighties). The motivation behind this was to prepare our dataset for potential classification tasks. We then performed an extensive Data Exploration process, where we analyzed surface-level differences (average length of sentences, number of unique words, etc.) between age groups. We also created Word Clouds to help in a visual understanding of our massive dataset. Figure 1 and 2 below are key results from our Data Exploration.

Refer to caption
Figure 1: Distribution by Age Group of our Final Dataset. Highlights less data for teens and older age groups.
Refer to caption
Figure 2: Visualizes how each source contributes to our final dataset of 608082 rows. Count represents number of entries per source before cleaning.

4 Applications

Refer to caption
(a) XLNet train loss (binary)
Refer to caption
(b) RoBERTa train loss (binary)
Refer to caption
(c) XLNet vs RoBERTa val acc. (binary)
Refer to caption
(d) XLNet train loss (multiclass)
Refer to caption
(e) RoBERTa train loss (multiclass)
Refer to caption
(f) XLNet vs RoBERTa val acc. (multiclass)
Figure 3: XLNet and RoBERTa models performance metrics on binary and multiclass tasks.

We present two applications of our age-diverse language dataset: Underage Detection and Generational Classification. These tasks demonstrate the utility of our dataset in identifying age-related linguistic patterns and showcase its potential for various real-world applications.

Underage Detection is a binary classification task that aims to differentiate between language patterns characteristic of minors (individuals under the age of 13) and young-adults and over (individuals aged 13 and above). This task has important implications in online safety, content moderation, and age-appropriate communication. By accurately identifying the age group of a speaker or writer based on their language use, online platforms can better protect minors from inappropriate content and interactions.

Generational Classification is a multiclass classification task that seeks to categorize language patterns into different generational age groups, such as kids, teens, twenties, thirties, and so on. Understanding the linguistic differences between age groups can provide valuable insights into language development, social trends, and age-related preferences. This information can be applied in various domains, including targeted marketing, product design, and social research.

Model Binary Multiclass
Naive Bayes 0.8798 0.3505
Roberta 0.9640 0.5462
XLNet 0.9599 0.5422
Table 1: F1 Scores for both underage and generation detection

4.1 Underage/Of-Age Detection

We trained three models for the Underage Detection task: a Naive Bayes classifier as a baseline, and fine-tuned RoBERTa and XLNet models. The Naive Bayes classifier served as a simple, probabilistic baseline for comparison. RoBERTa and XLNet, both state-of-the-art transformer-based models, were fine-tuned on our dataset to capture more complex linguistic patterns associated with age groups.

4.1.1 Results & Analysis

The fine-tuned RoBERTa model achieved the highest f1 score of 0.9640 and test accuracy of 0.9640, closely followed by the fine-tuned XLNet model with an f1 score of 0.9599 and test accuracy of 0.9599. Both transformer-based models significantly outperformed the Naive Bayes baseline, which obtained an f1 score of 0.8798 and test accuracy of 0.8799.
Figure 3 displays the training loss curves for the RoBERTa and XLNet models over the fine-tuning epochs. Both models exhibit a steady decrease in training loss, indicating effective learning and adaptation to the age-related linguistic patterns present in our dataset.

The results demonstrate the power of the fine-tuned RoBERTa and XLNet models over the Naive Bayes baseline for the Underage Detection task. The transformer-based models’ ability to capture complex linguistic patterns and contextual information contributed to their higher performance. The fine-tuning process allowed these models to adapt their pre-trained language understanding to the specific age-related patterns in our dataset. The strong performance of RoBERTa and XLNet highlights the potential of using advanced language models for accurate age group classification based on linguistic cues. This has significant implications for enhancing online safety measures, content moderation, and age-appropriate recommendations in various digital platforms.

4.2 Generation Classification

For the Generational Classification task, we employed the same three models as in the Underage Detection task. The methodology remained similar, with the models being trained to classify language patterns into different age groups.

4.2.1 Results & Analysis

The classification reports for the Naive Bayes, RoBERTa, and XLNet models are presented in the appendix [3]. The RoBERTa model achieved the highest overall f1 score of 0.5462, closely followed by the XLNet model with an f1 score of 0.5422. The Naive Bayes baseline performed significantly worse, with an f1 score of 0.3505.

Looking at the individual age groups, all three models performed exceptionally well in classifying the ”kids” group, with f1 scores above 0.9 for RoBERTa and XLNet. The performance on the ”teens” and ”twenties” groups was moderate, with f1 scores ranging from 0.19 to 0.58. For age groups above 30, the models’ performance varied, with the Naive Bayes classifier struggling across all older age groups, while RoBERTa and XLNet showed better performance for the ”thirties” and ”fourties” groups. However, the models struggled to classify the ”fifties,” ”sixties,” and ”seventies” age groups effectively. RoBERTa and XLNet achieved low f1 scores for these groups, and the Naive Bayes classifier performed poorly as well. It is worth noting that the support for these older age groups was relatively lower compared to the younger age groups and so there may have been a significant shortage of training data for these groups.

The Generational Classification task proved to be more challenging than the Underage Detection task, with lower f1 scores and accuracies achieved by the models, suggesting that capturing fine-grained linguistic differences between age groups is more complex. While the models performed strongly on the ”kids” group, indicating distinctive language patterns, performance decreased for older age groups, particularly ”fifties,” ”sixties,” and ”seventies,” likely due to limited data samples and less pronounced linguistic differences. To improve performance, future work could focus on collecting more diverse data for older age groups and exploring advanced modeling techniques. Despite the challenges, the fine-tuned RoBERTa and XLNet models demonstrated reasonable performance, outperforming the Naive Bayes baseline and highlighting the potential of advanced language models for granular age group classification.

5 Conclusion

As touched upon in the applications section, the use-cases of the dataset we curated are massive. In addition to classification tasks, our dataset can be used in Seq2Seq tasks where a given sentence is converted to text belonging to a certain age group. This would be specifically useful for targeted advertisements. While our current dataset is ready to use for a variety applications (reach out through email), we plan to expand our dataset to create an even more diverse and comprehensive corpus.

References

  • [1] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
  • [2] Alexis Blandin, Gwénolé Lecorvé, Delphine Battistelli, and Aline Etienne. Age recommendation for texts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1431–1439, 2020.
  • [3] Caner Hazirbas, Joanna Bitton, Brian Dolhansky, Jacqueline Pan, Albert Gordo, and Cristian Canton Ferrer. Towards measuring fairness in ai: the casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(3):324–332, 2021.
  • [4] W. E Hipson and S M. Mohammad. Poki: A large dataset of poems by children. arXiv preprint arXiv:2004.06188, 2020.
  • [5] Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. ” how old do you think i am?” a study of language and age in twitter. In Proceedings of the international AAAI conference on Web and social media, volume 7, pages 439–448, 2013.
  • [6] Utku Norman, Tanvi Dinkar, Barbara Bruno, and Chloé Clavel. Studying alignment in a collaborative learning activity via automatic methods: The link between what we say and do. arXiv preprint arXiv:2104.04429, 2021.
  • [7] James W Pennebaker and Lori D Stone. Words of wisdom: language use over the life span. Journal of personality and social psychology, 85(2):291, 2003.
  • [8] Avar Pentel. Automatic age detection using text readability features. Workshops Proceedings of EDM, 2015.
  • [9] H Andrew Schwartz, Johannes C Eichstaedt, Margaret L Kern, Lukasz Dziurzynski, Stephanie M Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin EP Seligman, et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8(9):e73791, 2013.

6 Appendix

Model Class Precision Recall F1-Score
Naive Bayes Underage 0.87 0.90 0.88
Of Age 0.90 0.86 0.88
Roberta Underage 0.97 0.96 0.96
Of Age 0.96 0.97 0.96
XLNet Underage 0.97 0.95 0.96
Of Age 0.95 0.97 0.96
Table 2: Underage detection f1 score breakdown
Model Class Precision Recall F1-Score
Naive Bayes Kids 0.05 0.01 0.02
Teens 0.28 0.06 0.09
Twenties 0.36 0.29 0.32
Thirties 0.66 0.78 0.71
Fourties 0.15 0.01 0.01
Fifties 0.27 0.04 0.07
Sixties 0.35 0.17 0.23
Seventies 0.29 0.29 0.29
Eighties 0.33 0.59 0.43
Roberta Kids 0.89 0.98 0.93
Teens 0.66 0.11 0.19
Twenties 0.51 0.65 0.57
Thirties 0.49 0.39 0.44
Fourties 0.56 0.61 0.58
Fifties 0.50 0.28 0.36
Sixties 0.00 0.00 0.00
Seventies 0.00 0.00 0.00
XLNet Kids 0.88 0.97 0.92
Teens 0.64 0.12 0.19
Twenties 0.50 0.69 0.58
Thirties 0.49 0.35 0.41
Fourties 0.56 0.61 0.59
Fifties 0.54 0.27 0.36
Sixties 0.00 0.00 0.00
Seventies 0.00 0.00 0.00
Table 3: Generation detection F1 Score breakdown