-
Demographics in Social Media Data for Public Health Research: Does it matter?
Authors:
Nina Cesare,
Christan Grant,
Jared B. Hawkins,
John S. Brownstein,
Elaine O. Nsoesie
Abstract:
Social media data provides propitious opportunities for public health research. However, studies suggest that disparities may exist in the representation of certain populations (e.g., people of lower socioeconomic status). To quantify and address these disparities in population representation, we need demographic information, which is usually missing from most social media platforms. Here, we prop…
▽ More
Social media data provides propitious opportunities for public health research. However, studies suggest that disparities may exist in the representation of certain populations (e.g., people of lower socioeconomic status). To quantify and address these disparities in population representation, we need demographic information, which is usually missing from most social media platforms. Here, we propose an ensemble approach for inferring demographics from social media data.
Several methods have been proposed for inferring demographic attributes such as, age, gender and race/ethnicity. However, most of these methods require large volumes of data, which makes their application to large scale studies challenging. We develop a scalable approach that relies only on user names to predict gender. We develop three separate classifiers trained on data containing the gender labels of 7,953 Twitter users from Kaggle.com. Next, we combine predictions from the individual classifiers using a stacked generalization technique and apply the ensemble classifier to a dataset of 36,085 geotagged foodborne illness related tweets from the United States.
Our ensemble approach achieves an accuracy, precision, recall, and F1 score of 0.828, 0.851, 0.852 and 0.837, respectively, higher than the individual machine learning approaches. The ensemble classifier also covers any user with an alphanumeric name, while the data matching approach, which achieves an accuracy of 0.917, only covers 67% of users. Application of our method to reports of foodborne illness in the United States highlights disparities in tweeting by gender and shows that counties with a high volume of foodborne-illness related tweets are heavily overrepresented by female Twitter users.
△ Less
Submitted 6 November, 2017; v1 submitted 30 October, 2017;
originally announced October 2017.
-
Redrawing the 'Color Line': Examining Racial Segregation in Associative Networks on Twitter
Authors:
Nina Cesare,
Hedwig Lee,
Tyler McCormick,
Emma S. Spiro
Abstract:
Online social spaces are increasingly salient contexts for associative tie formation. However, the racial composition of associative networks within most of these spaces has yet to be examined. In this paper, we use data from the social media platform Twitter to examine racial segregation patterns in online associative networks. Acknowledging past work on the role that social structure and agency…
▽ More
Online social spaces are increasingly salient contexts for associative tie formation. However, the racial composition of associative networks within most of these spaces has yet to be examined. In this paper, we use data from the social media platform Twitter to examine racial segregation patterns in online associative networks. Acknowledging past work on the role that social structure and agency play in influencing the racial composition of individuals' networks, we argue that Twitter blurs the influence of these forces and may invite users to generate networks that are both more or less segregated than what has been observed offline, depending on use. While we expect to find some level of racial segregation within this space, this paper unpacks the extent to which we observe same-race connectedness for black and white users, assesses whether these patterns are likely generated by opportunity or by choice, and contextualizes results by comparing them with patterns of same-race connectedness observed offline.
△ Less
Submitted 11 May, 2017;
originally announced May 2017.
-
How well can machine learning predict demographics of social media users?
Authors:
Nina Cesare,
Christan Grant,
Quynh Nguyen,
Hedwig Lee,
Elaine O. Nsoesie
Abstract:
The wide use of social media sites and other digital technologies have resulted in an unprecedented availability of digital data that are being used to study human behavior across research domains. Although unsolicited opinions and sentiments are available on these platforms, demographic details are usually missing. Demographic information is pertinent in fields such as demography and public healt…
▽ More
The wide use of social media sites and other digital technologies have resulted in an unprecedented availability of digital data that are being used to study human behavior across research domains. Although unsolicited opinions and sentiments are available on these platforms, demographic details are usually missing. Demographic information is pertinent in fields such as demography and public health, where significant differences can exist across sex, racial and socioeconomic groups. In an attempt to address this shortcoming, a number of academic studies have proposed methods for inferring the demographics of social media users using details such as names, usernames, and network characteristics. Gender is the easiest trait to accurately infer, with measures of accuracy higher than 90 percent in some studies. Race, ethnicity and age tend to be more challenging to predict for a variety of reasons including the novelty of social media to certain age groups and a lack of significant deviations in user details across racial and ethnic groups. Although the endeavor to predict user demographics is plagued with ethical questions regarding privacy and data ownership, knowing the demographics in a data sample can aid in addressing issues of bias and population representation, so that existing societal inequalities are not exacerbated.
△ Less
Submitted 30 May, 2018; v1 submitted 6 February, 2017;
originally announced February 2017.