Search | arXiv e-print repository

TextAge: A Curated and Diverse Text Dataset for Age Classification

Authors: Shravan Cheekati, Mridul Gupta, Vibha Raghu, Pranav Raj

Abstract: Age-related language patterns play a crucial role in understanding linguistic differences and develo** age-appropriate communication strategies. However, the lack of comprehensive and diverse datasets has hindered the progress of research in this area. To address this issue, we present TextAge, a curated text dataset that maps sentences to the age and age group of the producer, as well as an und… ▽ More Age-related language patterns play a crucial role in understanding linguistic differences and develo** age-appropriate communication strategies. However, the lack of comprehensive and diverse datasets has hindered the progress of research in this area. To address this issue, we present TextAge, a curated text dataset that maps sentences to the age and age group of the producer, as well as an underage (under 13) label. TextAge covers a wide range of ages and includes both spoken and written data from various sources such as CHILDES, Meta, Poki Poems-by-kids, JUSThink, and the TV show "Survivor." The dataset undergoes extensive cleaning and preprocessing to ensure data quality and consistency. We demonstrate the utility of TextAge through two applications: Underage Detection and Generational Classification. For Underage Detection, we train a Naive Bayes classifier, fine-tuned RoBERTa, and XLNet models to differentiate between language patterns of minors and young-adults and over. For Generational Classification, the models classify language patterns into different age groups (kids, teens, twenties, etc.). The models excel at classifying the "kids" group but struggle with older age groups, particularly "fifties," "sixties," and "seventies," likely due to limited data samples and less pronounced linguistic differences. TextAge offers a valuable resource for studying age-related language patterns and develo** age-sensitive language models. The dataset's diverse composition and the promising results of the classification tasks highlight its potential for various applications, such as content moderation, targeted advertising, and age-appropriate communication. Future work aims to expand the dataset further and explore advanced modeling techniques to improve performance on older age groups. △ Less

Submitted 2 May, 2024; originally announced June 2024.

arXiv:2306.08955 [pdf, other]

A Comparison of Self-Supervised Pretraining Approaches for Predicting Disease Risk from Chest Radiograph Images

Authors: Yanru Chen, Michael T Lu, Vineet K Raghu

Abstract: Deep learning is the state-of-the-art for medical imaging tasks, but requires large, labeled datasets. For risk prediction, large datasets are rare since they require both imaging and follow-up (e.g., diagnosis codes). However, the release of publicly available imaging data with diagnostic labels presents an opportunity for self and semi-supervised approaches to improve label efficiency for risk p… ▽ More Deep learning is the state-of-the-art for medical imaging tasks, but requires large, labeled datasets. For risk prediction, large datasets are rare since they require both imaging and follow-up (e.g., diagnosis codes). However, the release of publicly available imaging data with diagnostic labels presents an opportunity for self and semi-supervised approaches to improve label efficiency for risk prediction. Though several studies have compared self-supervised approaches in natural image classification, object detection, and medical image interpretation, there is limited data on which approaches learn robust representations for risk prediction. We present a comparison of semi- and self-supervised learning to predict mortality risk using chest x-ray images. We find that a semi-supervised autoencoder outperforms contrastive and transfer learning in internal and external validation. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 33 pages, 22 figures, Accepted for publication at MIDL 2023

arXiv:2106.10869 [pdf, other]

Coexistence of Multiple Magnetic Interactions in Oxygen Deficient V2O5 Nanoparticles

Authors: Tathagata Sarkar, Soumya Biswas, Sonali Kakkar, Appu Vengattoor Raghu, Chandan Bera, Vinayak B Kamble

Abstract: In this paper, we report the spin glass-like coexistence of Ferromagnetic (FM), paramagnetic (PM) and antiferromagnetic (AFM) orders in oxygen deficient V2O5 nanoparticles (NP). It has a chemical stoichiometry of nearly V2O4.65 and a bandgap of nearly 2.2 eV with gap states due to significant defect density. The temperature dependent electrical conductivity and thermopower measurements clearly dem… ▽ More In this paper, we report the spin glass-like coexistence of Ferromagnetic (FM), paramagnetic (PM) and antiferromagnetic (AFM) orders in oxygen deficient V2O5 nanoparticles (NP). It has a chemical stoichiometry of nearly V2O4.65 and a bandgap of nearly 2.2 eV with gap states due to significant defect density. The temperature dependent electrical conductivity and thermopower measurements clearly demonstrate a polaronic conduction mechanism of small polaron hop** with hop** energy of about 0.112 eV. The V2O5 sample shows a strong field as well as temperature dependent magnetic behavior when measured using a sensitive SQUID magnetometer. It shows a positive magnetic susceptibility over the entire temperature range (2- 350 K). The FC-ZFC data shows clear hysteresis indicating glassy behavior with dominant superparamagnetism (SPM). The Curie-Weiss fitting confirms the AFM to PM transition at nearly 280 K and the curie constant yields 1.56 muB which is close to single electron moment. The small polaron formation, which arises due to oxygen vacancy defects compensated by charge defects of V4+, results in Magneto-Electronic Phase Separation (MEPS) and hence various magnetic exchanges, as predicted by first principle calculations. This is further revealed as strong hybridization of V bonded to oxygen vacancy VO and neighboring V5+ ions, resulting in net magnetic moment per vacancy (1.77 muB which is in good agreement with experiment). Besides, the rise in V4+ defects is found to show AFM component as observed from the calculations. Thus, the diversity in magnetism of undoped V2O5 has its origin in defect number density as well as their random distribution led to MEPS. This involves localized spins in polarons and their FM clusters on the PM background while, V4+ dimers within and across the ladder, inducing AFM interactions. △ Less

Submitted 26 August, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

arXiv:2005.02521 [pdf, other]

A Pipeline for Integrated Theory and Data-Driven Modeling of Genomic and Clinical Data

Authors: Vineet K Raghu, Xiaoyu Ge, Arun Balajee, Daniel J. Shirer, Isha Das, Panayiotis V. Benos, Panos K. Chrysanthis

Abstract: High throughput genome sequencing technologies such as RNA-Seq and Microarray have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand causes of disease and the effects of medical interventions, this data must be integrated with phenotypic, environmental, and behavioral… ▽ More High throughput genome sequencing technologies such as RNA-Seq and Microarray have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand causes of disease and the effects of medical interventions, this data must be integrated with phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods that can infer relationships between these data types are required. In this work, we propose a pipeline for knowledge discovery from integrated genomic and clinical data. The pipeline begins with a novel variable selection method, and uses a probabilistic graphical model to understand the relationships between features in the data. We demonstrate how this pipeline can improve breast cancer outcome prediction models, and can provide a biologically interpretable view of sequencing data. △ Less

Submitted 5 May, 2020; originally announced May 2020.

Comments: 16 pages, 8 figures, Presented at the ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD, 2019)

ACM Class: J.3

arXiv:1803.03716 [pdf, other]

doi 10.1007/978-3-319-98923-5_8

TRAJEDI: Trajectory Dissimilarity

Authors: Pedram Gharani, Kenrick Fernande, Vineet Raghu

Abstract: The vast increase in our ability to obtain and store trajectory data necessitates trajectory analytics techniques to extract useful information from this data. Pair-wise distance functions are a foundation building block for common operations on trajectory datasets including constrained SELECT queries, k-nearest neighbors, and similarity and diversity algorithms. The accuracy and performance of th… ▽ More The vast increase in our ability to obtain and store trajectory data necessitates trajectory analytics techniques to extract useful information from this data. Pair-wise distance functions are a foundation building block for common operations on trajectory datasets including constrained SELECT queries, k-nearest neighbors, and similarity and diversity algorithms. The accuracy and performance of these operations depend heavily on the speed and accuracy of the underlying trajectory distance function, which is in turn affected by trajectory calibration. Current methods either require calibrated data, or perform calibration of the entire relevant dataset first, which is expensive and time consuming for large datasets. We present TRAJEDI, a calibrationaware pair-wise distance calculation scheme that outperforms naive approaches while preserving accuracy. We also provide analyses of parameter tuning to trade-off between speed and accuracy. Our scheme is usable with any diversity, similarity or k-nearest neighbor algorithm. △ Less

Submitted 9 March, 2018; originally announced March 2018.

Showing 1–5 of 5 results for author: Raghu, V