-
Computational Reproducibility in Computational Social Science
Authors:
David Schoch,
Chung-hong Chan,
Claudia Wagner,
Arnim Bleier
Abstract:
Replication crises have shaken the scientific landscape during the last decade. As potential solutions, open science practices were heavily discussed and have been implemented with varying success in different disciplines. We argue that computational-x disciplines such as computational social science, are also susceptible for the symptoms of the crises, but in terms of reproducibility. We expand t…
▽ More
Replication crises have shaken the scientific landscape during the last decade. As potential solutions, open science practices were heavily discussed and have been implemented with varying success in different disciplines. We argue that computational-x disciplines such as computational social science, are also susceptible for the symptoms of the crises, but in terms of reproducibility. We expand the binary definition of reproducibility into a tier system which allows increasing levels of reproducibility based on external verfiability to counteract the practice of open-washing. We provide solutions for barriers in Computational Social Science that hinder researchers from obtaining the highest level of reproducibility, including the use of alternate data sources and considering reproducibility proactively.
△ Less
Submitted 4 October, 2023; v1 submitted 4 July, 2023;
originally announced July 2023.
-
PADME-SoSci: A Platform for Analytics and Distributed Machine Learning for the Social Sciences
Authors:
Zeyd Boukhers,
Arnim Bleier,
Yeliz Ucer Yediel,
Mio Hienstorfer-Heitmann,
Mehrshad Jaberansary,
Adamantios Koumpis,
Oya Beyan
Abstract:
Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identi…
▽ More
Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identification. To address these limitations, we present PADME, a distributed analytics tool that federates model implementation and training. PADME uses a federated approach where the model is implemented and deployed by all parties and visits each data location incrementally for training. This enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location. Training the model on data in its original location preserves data ownership. Furthermore, the results are not provided until the analysis is completed on all data locations to ensure privacy and avoid bias in the results.
△ Less
Submitted 3 April, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
Characterizing the Global Crowd Workforce: A Cross-Country Comparison of Crowdworker Demographics
Authors:
Lisa Posch,
Arnim Bleier,
Fabian Flöck,
Clemens M. Lechner,
Katharina Kinder-Kurlanda,
Denis Helic,
Markus Strohmaier
Abstract:
Since its emergence roughly a decade ago, microtask crowdsourcing has been attracting a heterogeneous set of workers from all over the globe. This paper sets out to explore the characteristics of the international crowd workforce and offers a cross-national comparison of crowdworker populations from ten countries. We provide an analysis and comparison of demographic characteristics and shed light…
▽ More
Since its emergence roughly a decade ago, microtask crowdsourcing has been attracting a heterogeneous set of workers from all over the globe. This paper sets out to explore the characteristics of the international crowd workforce and offers a cross-national comparison of crowdworker populations from ten countries. We provide an analysis and comparison of demographic characteristics and shed light on the significance of microtask income for workers situated in different national contexts. With over 11,000 individual responses, this study is the first large-scale country-level analysis of the characteristics of workers on the platform Appen (formerly CrowdFlower and Figure Eight), one of the two platforms dominating the microtask market. We find large differences between the characteristics of the crowd workforces of different countries, both regarding demography and regarding the importance of microtask income for workers. Furthermore, we find that the composition of the workforce in the ten countries was largely stable across samples taken at different points in time.
△ Less
Submitted 3 November, 2022; v1 submitted 14 December, 2018;
originally announced December 2018.
-
iLCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data
Authors:
Andreas Niekler,
Arnim Bleier,
Christian Kahmann,
Lisa Posch,
Gregor Wiedemann,
Kenan Erdogan,
Gerhard Heyer,
Markus Strohmaier
Abstract:
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements for the reproducibility of data-driven research desi…
▽ More
The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements for the reproducibility of data-driven research designs in the social sciences. For this, the iLCM research environment comprises two central components. First, the Leipzig Corpus Miner (LCM), a decentralized SaaS application for the analysis of large amounts of news texts developed in a previous Digital Humanities project. Second, the text mining tools implemented in the LCM are extended by an "Open Research Computing" (ORC) environment for executable script documents, so-called "notebooks". This novel integration allows to combine generic, high-performance methods to process large amounts of unstructured text data and with individual program scripts to address specific research requirements in computational social science and digital humanities.
△ Less
Submitted 11 May, 2018;
originally announced May 2018.
-
Systematically Monitoring Social Media: The case of the German federal election 2017
Authors:
Sebastian Stier,
Arnim Bleier,
Malte Bonart,
Fabian Mörsheim,
Mahdi Bohlouli,
Margarita Nizhegorodov,
Lisa Posch,
Jürgen Maier,
Tobias Rothmund,
Steffen Staab
Abstract:
It is a considerable task to collect digital trace data at a large scale and at the same time adhere to established academic standards. In the context of political communication, important challenges are (1) defining the social media accounts and posts relevant to the campaign (content validity), (2) operationalizing the venues where relevant social media activity takes place (construct validity),…
▽ More
It is a considerable task to collect digital trace data at a large scale and at the same time adhere to established academic standards. In the context of political communication, important challenges are (1) defining the social media accounts and posts relevant to the campaign (content validity), (2) operationalizing the venues where relevant social media activity takes place (construct validity), (3) capturing all of the relevant social media activity (reliability), and (4) sharing as much data as possible for reuse and replication (objectivity). This project by GESIS - Leibniz Institute for the Social Sciences and the E-Democracy Program of the University of Koblenz-Landau conducted such an effort. We concentrated on the two social media networks of most political relevance, Facebook and Twitter.
△ Less
Submitted 9 April, 2018;
originally announced April 2018.
-
Election campaigning on social media: Politicians, audiences and the mediation of political communication on Facebook and Twitter
Authors:
Sebastian Stier,
Arnim Bleier,
Haiko Lietz,
Markus Strohmaier
Abstract:
Although considerable research has concentrated on online campaigning, it is still unclear how politicians use different social media platforms in political communication. Focusing on the German federal election campaign 2013, this article investigates whether election candidates address the topics most important to the mass audience and to which extent their communication is shaped by the charact…
▽ More
Although considerable research has concentrated on online campaigning, it is still unclear how politicians use different social media platforms in political communication. Focusing on the German federal election campaign 2013, this article investigates whether election candidates address the topics most important to the mass audience and to which extent their communication is shaped by the characteristics of Facebook and Twitter. Based on open-ended responses from a representative survey conducted during the election campaign, we train a human-interpretable Bayesian language model to identify political topics. Applying the model to social media messages of candidates and their direct audiences, we find that both prioritize different topics than the mass audience. The analysis also shows that politicians use Facebook and Twitter for different purposes. We relate the various findings to the mediation of political communication on social media induced by the particular characteristics of audiences and sociotechnical environments.
△ Less
Submitted 26 January, 2018;
originally announced January 2018.
-
A Cross-Country Comparison of Crowdworker Motivations
Authors:
Lisa Posch,
Arnim Bleier,
Fabian Flöck,
Markus Strohmaier
Abstract:
Crowd employment is a new form of short term employment that has been rapidly becoming a source of income for a vast number of people around the globe. It differs considerably from more traditional forms of work, yet similar ethical and optimization issues arise. One key to tackle such challenges is to understand what motivates the international crowd workforce. In this work, we study the motivati…
▽ More
Crowd employment is a new form of short term employment that has been rapidly becoming a source of income for a vast number of people around the globe. It differs considerably from more traditional forms of work, yet similar ethical and optimization issues arise. One key to tackle such challenges is to understand what motivates the international crowd workforce. In this work, we study the motivation of workers involved in one particularly prevalent type of crowd employment: micro-tasks. We report on the results of applying the Multidimensional Crowdworker Motivation Scale (MCMS) in ten countries, which unveil significant international differences.
△ Less
Submitted 8 November, 2017;
originally announced November 2017.
-
Measuring Motivations of Crowdworkers: The Multidimensional Crowdworker Motivation Scale
Authors:
Lisa Posch,
Arnim Bleier,
Clemens Lechner,
Daniel Danner,
Fabian Flöck,
Markus Strohmaier
Abstract:
Crowd employment is a new form of short-term and flexible employment which has emerged during the past decade. In order to understand this new form of employment, it is crucial to illuminate the underlying motivations of the workforce involved in it. This paper introduces the Multidimensional Crowdworker Motivation Scale (MCMS), a scale for measuring the motivation of crowdworkers on micro-task pl…
▽ More
Crowd employment is a new form of short-term and flexible employment which has emerged during the past decade. In order to understand this new form of employment, it is crucial to illuminate the underlying motivations of the workforce involved in it. This paper introduces the Multidimensional Crowdworker Motivation Scale (MCMS), a scale for measuring the motivation of crowdworkers on micro-task platforms. The MCMS is theoretically grounded in self-determination theory and tailored specifically to the context of paid crowdsourced micro-labor. The scale measures the motivation of crowdworkers along six motivational dimensions, ranging from amotivation to intrinsic motivation. We validated the MCMS on data collected in ten countries and three income groups. Factor analyses demonstrated that the MCMS's six dimensions showed good model fit, validity, and reliability. Furthermore, our measurement invariance tests showed that motivations measured with the MCMS are comparable across countries and income groups, and we present a first cross-country comparison of crowdworker motivations. This work constitutes an important first step towards understanding the motivations of the international crowd workforce.
△ Less
Submitted 15 March, 2019; v1 submitted 6 February, 2017;
originally announced February 2017.
-
Truncation-free Hybrid Inference for DPMM
Authors:
Arnim Bleier
Abstract:
Dirichlet process mixture models (DPMM) are a cornerstone of Bayesian non-parametrics. While these models free from choosing the number of components a-priori, computationally attractive variational inference often reintroduces the need to do so, via a truncation on the variational distribution. In this paper we present a truncation-free hybrid inference for DPMM, combining the advantages of sampl…
▽ More
Dirichlet process mixture models (DPMM) are a cornerstone of Bayesian non-parametrics. While these models free from choosing the number of components a-priori, computationally attractive variational inference often reintroduces the need to do so, via a truncation on the variational distribution. In this paper we present a truncation-free hybrid inference for DPMM, combining the advantages of sampling-based MCMC and variational methods. The proposed hybridization enables more efficient variational updates, while increasing model complexity only if needed. We evaluate the properties of the hybrid updates and their empirical performance in single- as well as mixed-membership models. Our method is easy to implement and performs favorably compared to existing schemas.
△ Less
Submitted 13 January, 2017;
originally announced January 2017.
-
A System for Probabilistic Linking of Thesauri and Classification Systems
Authors:
Lisa Posch,
Philipp Schaer,
Arnim Bleier,
Markus Strohmaier
Abstract:
This paper presents a system which creates and visualizes probabilistic semantic links between concepts in a thesaurus and classes in a classification system. For creating the links, we build on the Polylingual Labeled Topic Model (PLL-TM). PLL-TM identifies probable thesaurus descriptors for each class in the classification system by using information from the natural language text of documents,…
▽ More
This paper presents a system which creates and visualizes probabilistic semantic links between concepts in a thesaurus and classes in a classification system. For creating the links, we build on the Polylingual Labeled Topic Model (PLL-TM). PLL-TM identifies probable thesaurus descriptors for each class in the classification system by using information from the natural language text of documents, their assigned thesaurus descriptors and their designated classes. The links are then presented to users of the system in an interactive visualization, providing them with an automatically generated overview of the relations between the thesaurus and the classification system.
△ Less
Submitted 21 March, 2016;
originally announced March 2016.
-
The Polylingual Labeled Topic Model
Authors:
Lisa Posch,
Arnim Bleier,
Philipp Schaer,
Markus Strohmaier
Abstract:
In this paper, we present the Polylingual Labeled Topic Model, a model which combines the characteristics of the existing Polylingual Topic Model and Labeled LDA. The model accounts for multiple languages with separate topic distributions for each language while restricting the permitted topics of a document to a set of predefined labels. We explore the properties of the model in a two-language se…
▽ More
In this paper, we present the Polylingual Labeled Topic Model, a model which combines the characteristics of the existing Polylingual Topic Model and Labeled LDA. The model accounts for multiple languages with separate topic distributions for each language while restricting the permitted topics of a document to a set of predefined labels. We explore the properties of the model in a two-language setting on a dataset from the social science domain. Our experiments show that our model outperforms LDA and Labeled LDA in terms of their held-out perplexity and that it produces semantically coherent topics which are well interpretable by human subjects.
△ Less
Submitted 24 July, 2015;
originally announced July 2015.
-
When Politicians Talk: Assessing Online Conversational Practices of Political Parties on Twitter
Authors:
Haiko Lietz,
Claudia Wagner,
Arnim Bleier,
Markus Strohmaier
Abstract:
Assessing political conversations in social media requires a deeper understanding of the underlying practices and styles that drive these conversations. In this paper, we present a computational approach for assessing online conversational practices of political parties. Following a deductive approach, we devise a number of quantitative measures from a discussion of theoretical constructs in socio…
▽ More
Assessing political conversations in social media requires a deeper understanding of the underlying practices and styles that drive these conversations. In this paper, we present a computational approach for assessing online conversational practices of political parties. Following a deductive approach, we devise a number of quantitative measures from a discussion of theoretical constructs in sociological theory. The resulting measures make different - mostly qualitative - aspects of online conversational practices amenable to computation. We evaluate our computational approach by applying it in a case study. In particular, we study online conversational practices of German politicians on Twitter during the German federal election 2013. We find that political parties share some interesting patterns of behavior, but also exhibit some unique and interesting idiosyncrasies. Our work sheds light on (i) how complex cultural phenomena such as online conversational practices are amenable to quantification and (ii) the way social media such as Twitter are utilized by political parties.
△ Less
Submitted 27 May, 2014;
originally announced May 2014.
-
Social Media Monitoring of the Campaigns for the 2013 German Bundestag Elections on Facebook and Twitter
Authors:
Lars Kaczmirek,
Philipp Mayr,
Ravi Vatrapu,
Arnim Bleier,
Manuela Blumenberg,
Tobias Gummer,
Abid Hussain,
Katharina Kinder-Kurlanda,
Kaveh Manshaei,
Mark Thamm,
Katrin Weller,
Alexander Wenz,
Christof Wolf
Abstract:
As more and more people use social media to communicate their view and perception of elections, researchers have increasingly been collecting and analyzing data from social media platforms. Our research focuses on social media communication related to the 2013 election of the German parlia-ment [translation: Bundestagswahl 2013]. We constructed several social media datasets using data from Faceboo…
▽ More
As more and more people use social media to communicate their view and perception of elections, researchers have increasingly been collecting and analyzing data from social media platforms. Our research focuses on social media communication related to the 2013 election of the German parlia-ment [translation: Bundestagswahl 2013]. We constructed several social media datasets using data from Facebook and Twitter. First, we identified the most relevant candidates (n=2,346) and checked whether they maintained social media accounts. The Facebook data was collected in November 2013 for the period of January 2009 to October 2013. On Facebook we identified 1,408 Facebook walls containing approximately 469,000 posts. Twitter data was collected between June and December 2013 finishing with the constitution of the government. On Twitter we identified 1,009 candidates and 76 other agents, for example, journalists. We estimated the number of relevant tweets to exceed eight million for the period from July 27 to September 27 alone. In this document we summarize past research in the literature, discuss possibilities for research with our data set, explain the data collection procedures, and provide a description of the data and a discussion of issues for archiving and dissemination of social media data.
△ Less
Submitted 1 April, 2014; v1 submitted 16 December, 2013;
originally announced December 2013.
-
Practical Collapsed Stochastic Variational Inference for the HDP
Authors:
Arnim Bleier
Abstract:
Recent advances have made it feasible to apply the stochastic variational paradigm to a collapsed representation of latent Dirichlet allocation (LDA). While the stochastic variational paradigm has successfully been applied to an uncollapsed representation of the hierarchical Dirichlet process (HDP), no attempts to apply this type of inference in a collapsed setting of non-parametric topic modeling…
▽ More
Recent advances have made it feasible to apply the stochastic variational paradigm to a collapsed representation of latent Dirichlet allocation (LDA). While the stochastic variational paradigm has successfully been applied to an uncollapsed representation of the hierarchical Dirichlet process (HDP), no attempts to apply this type of inference in a collapsed setting of non-parametric topic modeling have been put forward so far. In this paper we explore such a collapsed stochastic variational Bayes inference for the HDP. The proposed online algorithm is easy to implement and accounts for the inference of hyper-parameters. First experiments show a promising improvement in predictive performance.
△ Less
Submitted 2 December, 2013;
originally announced December 2013.
-
Author Name Co-Mention Analysis: Testing a Poor Man's Author Co-Citation Analysis Method
Authors:
Andreas Strotmann,
Arnim Bleier
Abstract:
As a social science information service for the German language countries, we document research projects, publications, and data in relevant fields. At the same time, we aim to provide well-founded bibliometric studies of these fields. Performing a citation analysis on an area of the German social sciences is, however, a serious challenge given the low and likely significantly biased coverage of t…
▽ More
As a social science information service for the German language countries, we document research projects, publications, and data in relevant fields. At the same time, we aim to provide well-founded bibliometric studies of these fields. Performing a citation analysis on an area of the German social sciences is, however, a serious challenge given the low and likely significantly biased coverage of these fields in the standard citation databases. Citations, and especially author citations, play a highly significant role in that literature, however. In this work in progress, we report preliminary methods and results for an author name co-mention analysis of a large fragment of a particularly interesting corpus of German sociology: a quarter century's worth of the full-text proceedings of the Deutsche Gesellschaft fuer Soziologie (DGS), which celebrated its 100th anniversary meeting in 2012. Results are encouraging for this poor cousin of author co-citation analysis, but considerable refinements, especially of the underlying computational infrastructure for full-text analysis, appear advisable for full-scale deployment of this method.
△ Less
Submitted 20 September, 2013;
originally announced September 2013.
-
When Politicians Tweet: A Study on the Members of the German Federal Diet
Authors:
Mark Thamm,
Arnim Bleier
Abstract:
In this preliminary study we compare the characteristics of retweets and replies on more than 350,000 messages collected by following members of the German Federal Diet on Twitter. We find significant differences in the characteristics pointing to distinct types of usages for retweets and replies. Using time series and regression analysis we observe that the likelihood of a politician using replie…
▽ More
In this preliminary study we compare the characteristics of retweets and replies on more than 350,000 messages collected by following members of the German Federal Diet on Twitter. We find significant differences in the characteristics pointing to distinct types of usages for retweets and replies. Using time series and regression analysis we observe that the likelihood of a politician using replies increases with typical leisure times while retweets occur constant over time. Including formal references increases the probability of a message being retweeted but drops its chance of being replied. This hints to a more professional use for retweets while replies tend to have a personal connotation.
△ Less
Submitted 8 May, 2013;
originally announced May 2013.
-
Towards an Author-Topic-Term-Model Visualization of 100 Years of German Sociological Society Proceedings
Authors:
Arnim Bleier,
Andreas Strotmann
Abstract:
Author co-citation studies employ factor analysis to reduce high-dimensional co-citation matrices to low-dimensional and possibly interpretable factors, but these studies do not use any information from the text bodies of publications. We hypothesise that term frequencies may yield useful information for scientometric analysis. In our work we ask if word features in combination with Bayesian analy…
▽ More
Author co-citation studies employ factor analysis to reduce high-dimensional co-citation matrices to low-dimensional and possibly interpretable factors, but these studies do not use any information from the text bodies of publications. We hypothesise that term frequencies may yield useful information for scientometric analysis. In our work we ask if word features in combination with Bayesian analysis allow well-founded science map** studies. This work goes back to the roots of Mosteller and Wallace's (1964) statistical text analysis using word frequency features and a Bayesian inference approach, tough with different goals. To answer our research question we (i) introduce a new data set on which the experiments are carried out, (ii) describe the Bayesian model employed for inference and (iii) present first results of the analysis.
△ Less
Submitted 6 May, 2013;
originally announced May 2013.
-
A simple non-parametric Topic Mixture for Authors and Documents
Authors:
Arnim Bleier
Abstract:
This article reviews the Author-Topic Model and presents a new non-parametric extension based on the Hierarchical Dirichlet Process. The extension is especially suitable when no prior information about the number of components necessary is available. A blocked Gibbs sampler is described and focus put on staying as close as possible to the original model with only the minimum of theoretical and imp…
▽ More
This article reviews the Author-Topic Model and presents a new non-parametric extension based on the Hierarchical Dirichlet Process. The extension is especially suitable when no prior information about the number of components necessary is available. A blocked Gibbs sampler is described and focus put on staying as close as possible to the original model with only the minimum of theoretical and implementation overhead necessary.
△ Less
Submitted 4 December, 2012; v1 submitted 27 November, 2012;
originally announced November 2012.