Search | arXiv e-print repository

arXiv:2403.02931 [pdf]

Improving the quality of individual-level online information tracking: challenges of existing approaches and introduction of a new content- and long-tail sensitive academic solution

Authors: Silke Adam, Mykola Makhortykh, Michaela Maier, Viktor Aigenseer, Aleksandra Urman, Teresa Gil Lopez, Clara Christner, Ernesto de León, Roberto Ulloa

Abstract: This article evaluates the quality of data collection in individual-level desktop information tracking used in the social sciences and shows that the existing approaches face sampling issues, validity issues due to the lack of content-level data and their disregard of the variety of devices and long-tail consumption patterns as well as transparency and privacy issues. To overcome some of these pro… ▽ More This article evaluates the quality of data collection in individual-level desktop information tracking used in the social sciences and shows that the existing approaches face sampling issues, validity issues due to the lack of content-level data and their disregard of the variety of devices and long-tail consumption patterns as well as transparency and privacy issues. To overcome some of these problems, the article introduces a new academic tracking solution, WebTrack, an open source tracking tool maintained by a major European research institution. The design logic, the interfaces and the backend requirements for WebTrack, followed by a detailed examination of strengths and weaknesses of the tool, are discussed. Finally, using data from 1185 participants, the article empirically illustrates how an improvement in the data collection through WebTrack leads to new innovative shifts in the processing of tracking data. As WebTrack allows collecting the content people are exposed to on more than classical news platforms, we can strongly improve the detection of politics-related information consumption in tracking data with the application of automated content analysis compared to traditional approaches that rely on the list-based identification of news. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 73 pages

arXiv:2207.00489 [pdf]

Panning for gold: Lessons learned from the platform-agnostic automated detection of political content in textual data

Authors: Mykola Makhortykh, Ernesto de León, Aleksandra Urman, Clara Christner, Maryna Sydorova, Silke Adam, Michaela Maier, Teresa Gil-Lopez

Abstract: The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for develo** automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In… ▽ More The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for develo** automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In this paper, we discuss how these techniques can be used to detect political content across different platforms. Using three validation datasets, which include a variety of political and non-political textual documents from online platforms, we systematically compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. We also examine the impact of different modes of data preprocessing (e.g. stemming and stopword removal) on the low-cost implementations of these techniques using a large set (n = 66) of detection models. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models, in contrast to the more robust performance of dictionary-based models on noisy data. △ Less

Submitted 1 July, 2022; originally announced July 2022.

arXiv:2011.05537 [pdf, other]

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Authors: Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar, Eduardo de Leon, Anuj Desai, Joshua Allen

Abstract: Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the… ▽ More Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practioners alike. In addition, we propose QUAIL, an ensemble-based modeling approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: Under Review

arXiv:1804.04031 [pdf, other]

Flexible and Scalable Deep Learning with MMLSpark

Authors: Mark Hamilton, Sudarshan Raghunathan, Akshaya Annavajhala, Danil Kirsanov, Eduardo de Leon, Eli Barzilay, Ilya Matiach, Joe Davison, Maureen Busch, Miruna Oprescu, Ratan Sur, Roope Astala, Tong Wen, ChangYoung Park

Abstract: In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing libra… ▽ More In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing library OpenCV with Spark, and present a tool for the automated generation of PySpark wrappers from any SparkML estimator and use this tool to expose all work to the PySpark ecosystem. Finally, we provide a large library of tools for working and develo** within the Spark ecosystem. We apply this work to the automated classification of Snow Leopards from camera trap images, and provide an end to end solution for the non-profit conservation organization, the Snow Leopard Trust. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Journal ref: Proceedings of Machine Learning Research 82 (2017) 11-22, 4th International Conference on Predictive Applications and APIs

Showing 1–4 of 4 results for author: de León, E