-
Improving the quality of individual-level online information tracking: challenges of existing approaches and introduction of a new content- and long-tail sensitive academic solution
Authors:
Silke Adam,
Mykola Makhortykh,
Michaela Maier,
Viktor Aigenseer,
Aleksandra Urman,
Teresa Gil Lopez,
Clara Christner,
Ernesto de León,
Roberto Ulloa
Abstract:
This article evaluates the quality of data collection in individual-level desktop information tracking used in the social sciences and shows that the existing approaches face sampling issues, validity issues due to the lack of content-level data and their disregard of the variety of devices and long-tail consumption patterns as well as transparency and privacy issues. To overcome some of these pro…
▽ More
This article evaluates the quality of data collection in individual-level desktop information tracking used in the social sciences and shows that the existing approaches face sampling issues, validity issues due to the lack of content-level data and their disregard of the variety of devices and long-tail consumption patterns as well as transparency and privacy issues. To overcome some of these problems, the article introduces a new academic tracking solution, WebTrack, an open source tracking tool maintained by a major European research institution. The design logic, the interfaces and the backend requirements for WebTrack, followed by a detailed examination of strengths and weaknesses of the tool, are discussed. Finally, using data from 1185 participants, the article empirically illustrates how an improvement in the data collection through WebTrack leads to new innovative shifts in the processing of tracking data. As WebTrack allows collecting the content people are exposed to on more than classical news platforms, we can strongly improve the detection of politics-related information consumption in tracking data with the application of automated content analysis compared to traditional approaches that rely on the list-based identification of news.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Panning for gold: Lessons learned from the platform-agnostic automated detection of political content in textual data
Authors:
Mykola Makhortykh,
Ernesto de León,
Aleksandra Urman,
Clara Christner,
Maryna Sydorova,
Silke Adam,
Michaela Maier,
Teresa Gil-Lopez
Abstract:
The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for develo** automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In…
▽ More
The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for develo** automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In this paper, we discuss how these techniques can be used to detect political content across different platforms. Using three validation datasets, which include a variety of political and non-political textual documents from online platforms, we systematically compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. We also examine the impact of different modes of data preprocessing (e.g. stemming and stopword removal) on the low-cost implementations of these techniques using a large set (n = 66) of detection models. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models, in contrast to the more robust performance of dictionary-based models on noisy data.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Differentially Private Synthetic Data: Applied Evaluations and Enhancements
Authors:
Lucas Rosenblatt,
Xiaoyan Liu,
Samira Pouyanfar,
Eduardo de Leon,
Anuj Desai,
Joshua Allen
Abstract:
Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the…
▽ More
Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practioners alike. In addition, we propose QUAIL, an ensemble-based modeling approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
Flexible and Scalable Deep Learning with MMLSpark
Authors:
Mark Hamilton,
Sudarshan Raghunathan,
Akshaya Annavajhala,
Danil Kirsanov,
Eduardo de Leon,
Eli Barzilay,
Ilya Matiach,
Joe Davison,
Maureen Busch,
Miruna Oprescu,
Ratan Sur,
Roope Astala,
Tong Wen,
ChangYoung Park
Abstract:
In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing libra…
▽ More
In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing library OpenCV with Spark, and present a tool for the automated generation of PySpark wrappers from any SparkML estimator and use this tool to expose all work to the PySpark ecosystem. Finally, we provide a large library of tools for working and develo** within the Spark ecosystem. We apply this work to the automated classification of Snow Leopards from camera trap images, and provide an end to end solution for the non-profit conservation organization, the Snow Leopard Trust.
△ Less
Submitted 11 April, 2018;
originally announced April 2018.