Search | arXiv e-print repository

arXiv:2307.01918 [pdf, other]

Computational Reproducibility in Computational Social Science

Authors: David Schoch, Chung-hong Chan, Claudia Wagner, Arnim Bleier

Abstract: Replication crises have shaken the scientific landscape during the last decade. As potential solutions, open science practices were heavily discussed and have been implemented with varying success in different disciplines. We argue that computational-x disciplines such as computational social science, are also susceptible for the symptoms of the crises, but in terms of reproducibility. We expand t… ▽ More Replication crises have shaken the scientific landscape during the last decade. As potential solutions, open science practices were heavily discussed and have been implemented with varying success in different disciplines. We argue that computational-x disciplines such as computational social science, are also susceptible for the symptoms of the crises, but in terms of reproducibility. We expand the binary definition of reproducibility into a tier system which allows increasing levels of reproducibility based on external verfiability to counteract the practice of open-washing. We provide solutions for barriers in Computational Social Science that hinder researchers from obtaining the highest level of reproducibility, including the use of alternate data sources and considering reproducibility proactively. △ Less

Submitted 4 October, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: v1: Working Paper; v2: fixed missing citation in text; v3: fixed some minor errors and formatting; v4: shortened paper

arXiv:2303.18200 [pdf, other]

PADME-SoSci: A Platform for Analytics and Distributed Machine Learning for the Social Sciences

Authors: Zeyd Boukhers, Arnim Bleier, Yeliz Ucer Yediel, Mio Hienstorfer-Heitmann, Mehrshad Jaberansary, Adamantios Koumpis, Oya Beyan

Abstract: Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identi… ▽ More Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identification. To address these limitations, we present PADME, a distributed analytics tool that federates model implementation and training. PADME uses a federated approach where the model is implemented and deployed by all parties and visits each data location incrementally for training. This enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location. Training the model on data in its original location preserves data ownership. Furthermore, the results are not provided until the analysis is completed on all data locations to ensure privacy and avoid bias in the results. △ Less

Submitted 3 April, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

Comments: accepted to be published @ ACM/IEEE JCDL 2023 - Joint Conference on Digital Libraries

arXiv:1812.05948 [pdf, other]

doi 10.15346/hc.v9i1.106

Characterizing the Global Crowd Workforce: A Cross-Country Comparison of Crowdworker Demographics

Authors: Lisa Posch, Arnim Bleier, Fabian Flöck, Clemens M. Lechner, Katharina Kinder-Kurlanda, Denis Helic, Markus Strohmaier

Abstract: Since its emergence roughly a decade ago, microtask crowdsourcing has been attracting a heterogeneous set of workers from all over the globe. This paper sets out to explore the characteristics of the international crowd workforce and offers a cross-national comparison of crowdworker populations from ten countries. We provide an analysis and comparison of demographic characteristics and shed light… ▽ More Since its emergence roughly a decade ago, microtask crowdsourcing has been attracting a heterogeneous set of workers from all over the globe. This paper sets out to explore the characteristics of the international crowd workforce and offers a cross-national comparison of crowdworker populations from ten countries. We provide an analysis and comparison of demographic characteristics and shed light on the significance of microtask income for workers situated in different national contexts. With over 11,000 individual responses, this study is the first large-scale country-level analysis of the characteristics of workers on the platform Appen (formerly CrowdFlower and Figure Eight), one of the two platforms dominating the microtask market. We find large differences between the characteristics of the crowd workforces of different countries, both regarding demography and regarding the importance of microtask income for workers. Furthermore, we find that the composition of the workforce in the ten countries was largely stable across samples taken at different points in time. △ Less

Submitted 3 November, 2022; v1 submitted 14 December, 2018; originally announced December 2018.

Comments: 36 pages, 20 figures, final version as published in Human Computation

ACM Class: K.4

Journal ref: Human Computation, 9(1), 22-57 (2022)

arXiv:1805.11404 [pdf, other]

iLCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data

Authors: Andreas Niekler, Arnim Bleier, Christian Kahmann, Lisa Posch, Gregor Wiedemann, Kenan Erdogan, Gerhard Heyer, Markus Strohmaier

Abstract: The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements for the reproducibility of data-driven research desi… ▽ More The iLCM project pursues the development of an integrated research environment for the analysis of structured and unstructured data in a "Software as a Service" architecture (SaaS). The research environment addresses requirements for the quantitative evaluation of large amounts of qualitative data with text mining methods as well as requirements for the reproducibility of data-driven research designs in the social sciences. For this, the iLCM research environment comprises two central components. First, the Leipzig Corpus Miner (LCM), a decentralized SaaS application for the analysis of large amounts of news texts developed in a previous Digital Humanities project. Second, the text mining tools implemented in the LCM are extended by an "Open Research Computing" (ORC) environment for executable script documents, so-called "notebooks". This novel integration allows to combine generic, high-performance methods to process large amounts of unstructured text data and with individual program scripts to address specific research requirements in computational social science and digital humanities. △ Less

Submitted 11 May, 2018; originally announced May 2018.

Comments: 11th edition of the Language Resources and Evaluation Conference (LREC)

arXiv:1804.02888 [pdf, other]

doi 10.17605/OSF.IO/5ZPM9

Systematically Monitoring Social Media: The case of the German federal election 2017

Authors: Sebastian Stier, Arnim Bleier, Malte Bonart, Fabian Mörsheim, Mahdi Bohlouli, Margarita Nizhegorodov, Lisa Posch, Jürgen Maier, Tobias Rothmund, Steffen Staab

Abstract: It is a considerable task to collect digital trace data at a large scale and at the same time adhere to established academic standards. In the context of political communication, important challenges are (1) defining the social media accounts and posts relevant to the campaign (content validity), (2) operationalizing the venues where relevant social media activity takes place (construct validity),… ▽ More It is a considerable task to collect digital trace data at a large scale and at the same time adhere to established academic standards. In the context of political communication, important challenges are (1) defining the social media accounts and posts relevant to the campaign (content validity), (2) operationalizing the venues where relevant social media activity takes place (construct validity), (3) capturing all of the relevant social media activity (reliability), and (4) sharing as much data as possible for reuse and replication (objectivity). This project by GESIS - Leibniz Institute for the Social Sciences and the E-Democracy Program of the University of Koblenz-Landau conducted such an effort. We concentrated on the two social media networks of most political relevance, Facebook and Twitter. △ Less

Submitted 9 April, 2018; originally announced April 2018.

Comments: PID: http://nbn-resolving.de/urn:nbn:de:0168-ssoar-56149-4, GESIS Papers 2018|4

arXiv:1801.08825 [pdf, other]

Election campaigning on social media: Politicians, audiences and the mediation of political communication on Facebook and Twitter

Authors: Sebastian Stier, Arnim Bleier, Haiko Lietz, Markus Strohmaier

Abstract: Although considerable research has concentrated on online campaigning, it is still unclear how politicians use different social media platforms in political communication. Focusing on the German federal election campaign 2013, this article investigates whether election candidates address the topics most important to the mass audience and to which extent their communication is shaped by the charact… ▽ More Although considerable research has concentrated on online campaigning, it is still unclear how politicians use different social media platforms in political communication. Focusing on the German federal election campaign 2013, this article investigates whether election candidates address the topics most important to the mass audience and to which extent their communication is shaped by the characteristics of Facebook and Twitter. Based on open-ended responses from a representative survey conducted during the election campaign, we train a human-interpretable Bayesian language model to identify political topics. Applying the model to social media messages of candidates and their direct audiences, we find that both prioritize different topics than the mass audience. The analysis also shows that politicians use Facebook and Twitter for different purposes. We relate the various findings to the mediation of political communication on social media induced by the particular characteristics of audiences and sociotechnical environments. △ Less

Submitted 26 January, 2018; originally announced January 2018.

arXiv:1711.03115 [pdf, other]

A Cross-Country Comparison of Crowdworker Motivations

Authors: Lisa Posch, Arnim Bleier, Fabian Flöck, Markus Strohmaier

Abstract: Crowd employment is a new form of short term employment that has been rapidly becoming a source of income for a vast number of people around the globe. It differs considerably from more traditional forms of work, yet similar ethical and optimization issues arise. One key to tackle such challenges is to understand what motivates the international crowd workforce. In this work, we study the motivati… ▽ More Crowd employment is a new form of short term employment that has been rapidly becoming a source of income for a vast number of people around the globe. It differs considerably from more traditional forms of work, yet similar ethical and optimization issues arise. One key to tackle such challenges is to understand what motivates the international crowd workforce. In this work, we study the motivation of workers involved in one particularly prevalent type of crowd employment: micro-tasks. We report on the results of applying the Multidimensional Crowdworker Motivation Scale (MCMS) in ten countries, which unveil significant international differences. △ Less

Submitted 8 November, 2017; originally announced November 2017.

Comments: 3rd Annual International Conference on Computational Social Science (IC2S2), 2017

arXiv:1702.01661 [pdf, other]

Measuring Motivations of Crowdworkers: The Multidimensional Crowdworker Motivation Scale

Authors: Lisa Posch, Arnim Bleier, Clemens Lechner, Daniel Danner, Fabian Flöck, Markus Strohmaier

Abstract: Crowd employment is a new form of short-term and flexible employment which has emerged during the past decade. In order to understand this new form of employment, it is crucial to illuminate the underlying motivations of the workforce involved in it. This paper introduces the Multidimensional Crowdworker Motivation Scale (MCMS), a scale for measuring the motivation of crowdworkers on micro-task pl… ▽ More Crowd employment is a new form of short-term and flexible employment which has emerged during the past decade. In order to understand this new form of employment, it is crucial to illuminate the underlying motivations of the workforce involved in it. This paper introduces the Multidimensional Crowdworker Motivation Scale (MCMS), a scale for measuring the motivation of crowdworkers on micro-task platforms. The MCMS is theoretically grounded in self-determination theory and tailored specifically to the context of paid crowdsourced micro-labor. The scale measures the motivation of crowdworkers along six motivational dimensions, ranging from amotivation to intrinsic motivation. We validated the MCMS on data collected in ten countries and three income groups. Factor analyses demonstrated that the MCMS's six dimensions showed good model fit, validity, and reliability. Furthermore, our measurement invariance tests showed that motivations measured with the MCMS are comparable across countries and income groups, and we present a first cross-country comparison of crowdworker motivations. This work constitutes an important first step towards understanding the motivations of the international crowd workforce. △ Less

Submitted 15 March, 2019; v1 submitted 6 February, 2017; originally announced February 2017.

Comments: 33 pages; added section; additional validation; corrected typos

arXiv:1701.03743 [pdf, other]

Truncation-free Hybrid Inference for DPMM

Authors: Arnim Bleier

Abstract: Dirichlet process mixture models (DPMM) are a cornerstone of Bayesian non-parametrics. While these models free from choosing the number of components a-priori, computationally attractive variational inference often reintroduces the need to do so, via a truncation on the variational distribution. In this paper we present a truncation-free hybrid inference for DPMM, combining the advantages of sampl… ▽ More Dirichlet process mixture models (DPMM) are a cornerstone of Bayesian non-parametrics. While these models free from choosing the number of components a-priori, computationally attractive variational inference often reintroduces the need to do so, via a truncation on the variational distribution. In this paper we present a truncation-free hybrid inference for DPMM, combining the advantages of sampling-based MCMC and variational methods. The proposed hybridization enables more efficient variational updates, while increasing model complexity only if needed. We evaluate the properties of the hybrid updates and their empirical performance in single- as well as mixed-membership models. Our method is easy to implement and performs favorably compared to existing schemas. △ Less

Submitted 13 January, 2017; originally announced January 2017.

Comments: NIPS 2016 Workshop: Advances in Approximate Bayesian Inference

arXiv:1603.06485 [pdf, other]

doi 10.1007/s13218-015-0413-9

A System for Probabilistic Linking of Thesauri and Classification Systems

Authors: Lisa Posch, Philipp Schaer, Arnim Bleier, Markus Strohmaier

Abstract: This paper presents a system which creates and visualizes probabilistic semantic links between concepts in a thesaurus and classes in a classification system. For creating the links, we build on the Polylingual Labeled Topic Model (PLL-TM). PLL-TM identifies probable thesaurus descriptors for each class in the classification system by using information from the natural language text of documents,… ▽ More This paper presents a system which creates and visualizes probabilistic semantic links between concepts in a thesaurus and classes in a classification system. For creating the links, we build on the Polylingual Labeled Topic Model (PLL-TM). PLL-TM identifies probable thesaurus descriptors for each class in the classification system by using information from the natural language text of documents, their assigned thesaurus descriptors and their designated classes. The links are then presented to users of the system in an interactive visualization, providing them with an automatically generated overview of the relations between the thesaurus and the classification system. △ Less

Submitted 21 March, 2016; originally announced March 2016.

Journal ref: KI - Künstliche Intelligenz, 2015

arXiv:1507.06829 [pdf, other]

doi 10.1007/978-3-319-24489-1_26

The Polylingual Labeled Topic Model

Authors: Lisa Posch, Arnim Bleier, Philipp Schaer, Markus Strohmaier

Abstract: In this paper, we present the Polylingual Labeled Topic Model, a model which combines the characteristics of the existing Polylingual Topic Model and Labeled LDA. The model accounts for multiple languages with separate topic distributions for each language while restricting the permitted topics of a document to a set of predefined labels. We explore the properties of the model in a two-language se… ▽ More In this paper, we present the Polylingual Labeled Topic Model, a model which combines the characteristics of the existing Polylingual Topic Model and Labeled LDA. The model accounts for multiple languages with separate topic distributions for each language while restricting the permitted topics of a document to a set of predefined labels. We explore the properties of the model in a two-language setting on a dataset from the social science domain. Our experiments show that our model outperforms LDA and Labeled LDA in terms of their held-out perplexity and that it produces semantically coherent topics which are well interpretable by human subjects. △ Less

Submitted 24 July, 2015; originally announced July 2015.

Comments: Accepted for publication at KI 2015 (38th edition of the German Conference on Artificial Intelligence)

ACM Class: G.3; I.2.7

arXiv:1405.6824 [pdf, other]

When Politicians Talk: Assessing Online Conversational Practices of Political Parties on Twitter

Authors: Haiko Lietz, Claudia Wagner, Arnim Bleier, Markus Strohmaier

Abstract: Assessing political conversations in social media requires a deeper understanding of the underlying practices and styles that drive these conversations. In this paper, we present a computational approach for assessing online conversational practices of political parties. Following a deductive approach, we devise a number of quantitative measures from a discussion of theoretical constructs in socio… ▽ More Assessing political conversations in social media requires a deeper understanding of the underlying practices and styles that drive these conversations. In this paper, we present a computational approach for assessing online conversational practices of political parties. Following a deductive approach, we devise a number of quantitative measures from a discussion of theoretical constructs in sociological theory. The resulting measures make different - mostly qualitative - aspects of online conversational practices amenable to computation. We evaluate our computational approach by applying it in a case study. In particular, we study online conversational practices of German politicians on Twitter during the German federal election 2013. We find that political parties share some interesting patterns of behavior, but also exhibit some unique and interesting idiosyncrasies. Our work sheds light on (i) how complex cultural phenomena such as online conversational practices are amenable to quantification and (ii) the way social media such as Twitter are utilized by political parties. △ Less

Submitted 27 May, 2014; originally announced May 2014.

Comments: 10 pages, 2 figures, 3 tables, Proc. 8th International AAAI Conference on Weblogs and Social Media (ICWSM 2014)

arXiv:1312.4476 [pdf]

Social Media Monitoring of the Campaigns for the 2013 German Bundestag Elections on Facebook and Twitter

Authors: Lars Kaczmirek, Philipp Mayr, Ravi Vatrapu, Arnim Bleier, Manuela Blumenberg, Tobias Gummer, Abid Hussain, Katharina Kinder-Kurlanda, Kaveh Manshaei, Mark Thamm, Katrin Weller, Alexander Wenz, Christof Wolf

Abstract: As more and more people use social media to communicate their view and perception of elections, researchers have increasingly been collecting and analyzing data from social media platforms. Our research focuses on social media communication related to the 2013 election of the German parlia-ment [translation: Bundestagswahl 2013]. We constructed several social media datasets using data from Faceboo… ▽ More As more and more people use social media to communicate their view and perception of elections, researchers have increasingly been collecting and analyzing data from social media platforms. Our research focuses on social media communication related to the 2013 election of the German parlia-ment [translation: Bundestagswahl 2013]. We constructed several social media datasets using data from Facebook and Twitter. First, we identified the most relevant candidates (n=2,346) and checked whether they maintained social media accounts. The Facebook data was collected in November 2013 for the period of January 2009 to October 2013. On Facebook we identified 1,408 Facebook walls containing approximately 469,000 posts. Twitter data was collected between June and December 2013 finishing with the constitution of the government. On Twitter we identified 1,009 candidates and 76 other agents, for example, journalists. We estimated the number of relevant tweets to exceed eight million for the period from July 27 to September 27 alone. In this document we summarize past research in the literature, discuss possibilities for research with our data set, explain the data collection procedures, and provide a description of the data and a discussion of issues for archiving and dissemination of social media data. △ Less

Submitted 1 April, 2014; v1 submitted 16 December, 2013; originally announced December 2013.

Comments: 29 pages, 2 figures, GESIS-Working Papers No. 31

arXiv:1312.0412 [pdf, other]

Practical Collapsed Stochastic Variational Inference for the HDP

Authors: Arnim Bleier

Abstract: Recent advances have made it feasible to apply the stochastic variational paradigm to a collapsed representation of latent Dirichlet allocation (LDA). While the stochastic variational paradigm has successfully been applied to an uncollapsed representation of the hierarchical Dirichlet process (HDP), no attempts to apply this type of inference in a collapsed setting of non-parametric topic modeling… ▽ More Recent advances have made it feasible to apply the stochastic variational paradigm to a collapsed representation of latent Dirichlet allocation (LDA). While the stochastic variational paradigm has successfully been applied to an uncollapsed representation of the hierarchical Dirichlet process (HDP), no attempts to apply this type of inference in a collapsed setting of non-parametric topic modeling have been put forward so far. In this paper we explore such a collapsed stochastic variational Bayes inference for the HDP. The proposed online algorithm is easy to implement and accounts for the inference of hyper-parameters. First experiments show a promising improvement in predictive performance. △ Less

Submitted 2 December, 2013; originally announced December 2013.

Comments: NIPS Workshop; Topic Models: Computation, Application, and Evaluation

arXiv:1309.5256 [pdf]

Author Name Co-Mention Analysis: Testing a Poor Man's Author Co-Citation Analysis Method

Authors: Andreas Strotmann, Arnim Bleier

Abstract: As a social science information service for the German language countries, we document research projects, publications, and data in relevant fields. At the same time, we aim to provide well-founded bibliometric studies of these fields. Performing a citation analysis on an area of the German social sciences is, however, a serious challenge given the low and likely significantly biased coverage of t… ▽ More As a social science information service for the German language countries, we document research projects, publications, and data in relevant fields. At the same time, we aim to provide well-founded bibliometric studies of these fields. Performing a citation analysis on an area of the German social sciences is, however, a serious challenge given the low and likely significantly biased coverage of these fields in the standard citation databases. Citations, and especially author citations, play a highly significant role in that literature, however. In this work in progress, we report preliminary methods and results for an author name co-mention analysis of a large fragment of a particularly interesting corpus of German sociology: a quarter century's worth of the full-text proceedings of the Deutsche Gesellschaft fuer Soziologie (DGS), which celebrated its 100th anniversary meeting in 2012. Results are encouraging for this poor cousin of author co-citation analysis, but considerable refinements, especially of the underlying computational infrastructure for full-text analysis, appear advisable for full-scale deployment of this method. △ Less

Submitted 20 September, 2013; originally announced September 2013.

Comments: 14th International Society of Scientometrics and Informetrics Conference

arXiv:1305.1734 [pdf]

When Politicians Tweet: A Study on the Members of the German Federal Diet

Authors: Mark Thamm, Arnim Bleier

Abstract: In this preliminary study we compare the characteristics of retweets and replies on more than 350,000 messages collected by following members of the German Federal Diet on Twitter. We find significant differences in the characteristics pointing to distinct types of usages for retweets and replies. Using time series and regression analysis we observe that the likelihood of a politician using replie… ▽ More In this preliminary study we compare the characteristics of retweets and replies on more than 350,000 messages collected by following members of the German Federal Diet on Twitter. We find significant differences in the characteristics pointing to distinct types of usages for retweets and replies. Using time series and regression analysis we observe that the likelihood of a politician using replies increases with typical leisure times while retweets occur constant over time. Including formal references increases the probability of a message being retweeted but drops its chance of being replied. This hints to a more professional use for retweets while replies tend to have a personal connotation. △ Less

Submitted 8 May, 2013; originally announced May 2013.

Comments: 6 pages, ACM Web Science 2013

ACM Class: H.1.2

arXiv:1305.1343 [pdf, other]

Towards an Author-Topic-Term-Model Visualization of 100 Years of German Sociological Society Proceedings

Authors: Arnim Bleier, Andreas Strotmann

Abstract: Author co-citation studies employ factor analysis to reduce high-dimensional co-citation matrices to low-dimensional and possibly interpretable factors, but these studies do not use any information from the text bodies of publications. We hypothesise that term frequencies may yield useful information for scientometric analysis. In our work we ask if word features in combination with Bayesian analy… ▽ More Author co-citation studies employ factor analysis to reduce high-dimensional co-citation matrices to low-dimensional and possibly interpretable factors, but these studies do not use any information from the text bodies of publications. We hypothesise that term frequencies may yield useful information for scientometric analysis. In our work we ask if word features in combination with Bayesian analysis allow well-founded science map** studies. This work goes back to the roots of Mosteller and Wallace's (1964) statistical text analysis using word frequency features and a Bayesian inference approach, tough with different goals. To answer our research question we (i) introduce a new data set on which the experiments are carried out, (ii) describe the Bayesian model employed for inference and (iii) present first results of the analysis. △ Less

Submitted 6 May, 2013; originally announced May 2013.

Comments: Accepted: 14th International Society of Scientometrics and Informetrics Conference, Vienna Austria 15-19th July 2013

arXiv:1211.6248 [pdf, ps, other]

A simple non-parametric Topic Mixture for Authors and Documents

Authors: Arnim Bleier

Abstract: This article reviews the Author-Topic Model and presents a new non-parametric extension based on the Hierarchical Dirichlet Process. The extension is especially suitable when no prior information about the number of components necessary is available. A blocked Gibbs sampler is described and focus put on staying as close as possible to the original model with only the minimum of theoretical and imp… ▽ More This article reviews the Author-Topic Model and presents a new non-parametric extension based on the Hierarchical Dirichlet Process. The extension is especially suitable when no prior information about the number of components necessary is available. A blocked Gibbs sampler is described and focus put on staying as close as possible to the original model with only the minimum of theoretical and implementation overhead necessary. △ Less

Submitted 4 December, 2012; v1 submitted 27 November, 2012; originally announced November 2012.

Showing 1–18 of 18 results for author: Bleier, A