Search | arXiv e-print repository

arXiv:2406.19569 [pdf, other]

On the Centralization and Regionalization of the Web

Authors: Gautam Akiwate, Kimberly Ruth, Rumaisa Habib, Zakir Durumeric

Abstract: Over the past decade, Internet centralization and its implications for both people and the resilience of the Internet has become a topic of active debate. While the networking community informally agrees on the definition of centralization, we lack a formal metric for quantifying centralization, which limits research beyond descriptive analysis. In this work, we introduce a statistical measure for… ▽ More Over the past decade, Internet centralization and its implications for both people and the resilience of the Internet has become a topic of active debate. While the networking community informally agrees on the definition of centralization, we lack a formal metric for quantifying centralization, which limits research beyond descriptive analysis. In this work, we introduce a statistical measure for Internet centralization, which we use to better understand how the web is centralized across four layers of web infrastructure (hosting providers, DNS infrastructure, TLDs, and certificate authorities) in 150~countries. Our work uncovers significant geographical variation, as well as a complex interplay between centralization and sociopolitically driven regionalization. We hope that our work can serve as the foundation for more nuanced analysis to inform this important debate. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.15585 [pdf, other]

Ten Years of ZMap

Authors: Zakir Durumeric, David Adrian, Phillip Stephens, Eric Wustrow, J. Alex Halderman

Abstract: Since ZMap's debut in 2013, networking and security researchers have used the open-source scanner to write hundreds of research papers that study Internet behavior. In addition, ZMap powers much of the attack-surface management and security ratings industries, and more than a dozen security companies have built products on top of ZMap. Behind the scenes, much of ZMap's behavior - ranging from its… ▽ More Since ZMap's debut in 2013, networking and security researchers have used the open-source scanner to write hundreds of research papers that study Internet behavior. In addition, ZMap powers much of the attack-surface management and security ratings industries, and more than a dozen security companies have built products on top of ZMap. Behind the scenes, much of ZMap's behavior - ranging from its pseudorandom IP generation to its packet construction - has quietly evolved as we have learned more about how to scan the Internet. In this work, we quantify ZMap's adoption over the ten years since its release, describe its modern behavior (and the measurements that motivated those changes), and offer lessons from releasing and maintaining ZMap. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 12 pages, 7 figures, in submission at Internet Measurement Conference 2024

arXiv:2404.11763 [pdf, other]

The Code the World Depends On: A First Look at Technology Makers' Open Source Software Dependencies

Authors: Cadence Patrick, Kimberly Ruth, Zakir Durumeric

Abstract: Open-source software (OSS) supply chain security has become a topic of concern for organizations. Patching an OSS vulnerability can require updating other dependent software products in addition to the original package. However, the landscape of OSS dependencies is not well explored: we do not know what packages are most critical to patch, hindering efforts to improve OSS security where it is most… ▽ More Open-source software (OSS) supply chain security has become a topic of concern for organizations. Patching an OSS vulnerability can require updating other dependent software products in addition to the original package. However, the landscape of OSS dependencies is not well explored: we do not know what packages are most critical to patch, hindering efforts to improve OSS security where it is most needed. There is thus a need to understand OSS usage in major software and device makers' products. Our work takes a first step toward closing this knowledge gap. We investigate published OSS dependency information for 108 major software and device makers, cataloging how available and how detailed this information is and identifying the OSS packages that appear the most frequently in our data. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2402.06099 [pdf, other]

CATO: End-to-End Optimization of ML-Based Traffic Analysis Pipelines

Authors: Gerry Wan, Shinan Liu, Francesco Bronzino, Nick Feamster, Zakir Durumeric

Abstract: Machine learning has shown tremendous potential for improving the capabilities of network traffic analysis applications, often outperforming simpler rule-based heuristics. However, ML-based solutions remain difficult to deploy in practice. Many existing approaches only optimize the predictive performance of their models, overlooking the practical challenges of running them against network traffic… ▽ More Machine learning has shown tremendous potential for improving the capabilities of network traffic analysis applications, often outperforming simpler rule-based heuristics. However, ML-based solutions remain difficult to deploy in practice. Many existing approaches only optimize the predictive performance of their models, overlooking the practical challenges of running them against network traffic in real time. This is especially problematic in the domain of traffic analysis, where the efficiency of the serving pipeline is a critical factor in determining the usability of a model. In this work, we introduce CATO, a framework that addresses this problem by jointly optimizing the predictive performance and the associated systems costs of the serving pipeline. CATO leverages recent advances in multi-objective Bayesian optimization to efficiently identify Pareto-optimal configurations, and automatically compiles end-to-end optimized serving pipelines that can be deployed in real networks. Our evaluations show that compared to popular feature optimization techniques, CATO can provide up to 3600x lower inference latency and 3.7x higher zero-loss throughput while simultaneously achieving better model performance. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2401.11032 [pdf, other]

PressProtect: Hel** Journalists Navigate Social Media in the Face of Online Harassment

Authors: Catherine Han, Anne Li, Deepak Kumar, Zakir Durumeric

Abstract: Social media has become a critical tool for journalists to disseminate their work, engage with their audience, and connect with sources. Unfortunately, journalists also regularly endure significant online harassment on social media platforms, ranging from personal attacks to doxxing to threats of physical harm. In this paper, we seek to understand how we can make social media usable for journalist… ▽ More Social media has become a critical tool for journalists to disseminate their work, engage with their audience, and connect with sources. Unfortunately, journalists also regularly endure significant online harassment on social media platforms, ranging from personal attacks to doxxing to threats of physical harm. In this paper, we seek to understand how we can make social media usable for journalists who face constant digital harassment. To begin, we conduct a set of need-finding interviews to understand where existing platform tools and newsroom resources fall short in adequately protecting journalists. We map journalists' unmet needs to concrete design goals, which we use to build PressProtect, an interface that provides journalists greater agency over engaging with readers on Twitter/X. Through user testing with eight journalists, we evaluate PressProtect and find that participants felt it effectively protected them against harassment and could also generalize to serve other visible and vulnerable groups. We conclude with a discussion of our findings and recommendations for social platforms ho** to build defensive defaults for journalists facing online harassment. △ Less

Submitted 19 January, 2024; originally announced January 2024.

arXiv:2310.14450 [pdf, other]

TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings

Authors: Hans W. A. Hanley, Zakir Durumeric

Abstract: Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variet… ▽ More Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 $F_1$-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata. △ Less

Submitted 8 February, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023; Updated citations

arXiv:2309.14517 [pdf, other]

Watch Your Language: Investigating Content Moderation with Large Language Models

Authors: Deepak Kumar, Yousef AbuHashem, Zakir Durumeric

Abstract: Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of commodity LLMs on two common content moderation ta… ▽ More Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of commodity LLMs on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we instantiate 95 subcommunity specific LLMs by prompting GPT-3.5 with rules from 95 Reddit subcommunities. We find that GPT-3.5 is effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we evaluate a suite of commodity LLMs (GPT-3, GPT-3.5, GPT-4, Gemini Pro, LLAMA 2) and show that LLMs significantly outperform currently widespread toxicity classifiers. However, recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation. △ Less

Submitted 17 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

arXiv:2309.13496 [pdf, other]

doi 10.1145/3471621.3473500

Stratosphere: Finding Vulnerable Cloud Storage Buckets

Authors: Jack Cable, Drew Gregory, Liz Izhikevich, Zakir Durumeric

Abstract: Misconfigured cloud storage buckets have leaked hundreds of millions of medical, voter, and customer records. These breaches are due to a combination of easily-guessable bucket names and error-prone security configurations, which, together, allow attackers to easily guess and access sensitive data. In this work, we investigate the security of buckets, finding that prior studies have largely undere… ▽ More Misconfigured cloud storage buckets have leaked hundreds of millions of medical, voter, and customer records. These breaches are due to a combination of easily-guessable bucket names and error-prone security configurations, which, together, allow attackers to easily guess and access sensitive data. In this work, we investigate the security of buckets, finding that prior studies have largely underestimated cloud insecurity by focusing on simple, easy-to-guess names. By leveraging prior work in the password analysis space, we introduce Stratosphere, a system that learns how buckets are named in practice in order to efficiently guess the names of vulnerable buckets. Using Stratosphere, we find wide-spread exploitation of buckets and vulnerable configurations continuing to increase over the years. We conclude with recommendations for operators, researchers, and cloud providers. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: Proceedings of the 24th International Symposium on Research in Attacks, Intrusions and Defenses. 2021

arXiv:2309.13495 [pdf, other]

doi 10.1145/3517745.3561434

ZDNS: A Fast DNS Toolkit for Internet Measurement

Authors: Liz Izhikevich, Gautam Akiwate, Briana Berger, Spencer Drakontaidis, Anna Ascheman, Paul Pearce, David Adrian, Zakir Durumeric

Abstract: Active DNS measurement is fundamental to understanding and improving the DNS ecosystem. However, the absence of an extensible, high-performance, and easy-to-use DNS toolkit has limited both the reproducibility and coverage of DNS research. In this paper, we introduce ZDNS, a modular and open-source active DNS measurement framework optimized for large-scale research studies of DNS on the public Int… ▽ More Active DNS measurement is fundamental to understanding and improving the DNS ecosystem. However, the absence of an extensible, high-performance, and easy-to-use DNS toolkit has limited both the reproducibility and coverage of DNS research. In this paper, we introduce ZDNS, a modular and open-source active DNS measurement framework optimized for large-scale research studies of DNS on the public Internet. We describe ZDNS' architecture, evaluate its performance, and present two case studies that highlight how the tool can be used to shed light on the operational complexities of DNS. We hope that ZDNS will enable researchers to better -- and in a more reproducible manner -- understand Internet behavior. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: Proceedings of the 22nd ACM Internet Measurement Conference. 2022

arXiv:2309.13471 [pdf, other]

doi 10.1145/3618257.3624818

Cloud Watching: Understanding Attacks Against Cloud-Hosted Services

Authors: Liz Izhikevich, Manda Tran, Michalis Kallitsis, Aurore Fass, Zakir Durumeric

Abstract: Cloud computing has dramatically changed service deployment patterns. In this work, we analyze how attackers identify and target cloud services in contrast to traditional enterprise networks and network telescopes. Using a diverse set of cloud honeypots in 5~providers and 23~countries as well as 2~educational networks and 1~network telescope, we analyze how IP address assignment, geography, networ… ▽ More Cloud computing has dramatically changed service deployment patterns. In this work, we analyze how attackers identify and target cloud services in contrast to traditional enterprise networks and network telescopes. Using a diverse set of cloud honeypots in 5~providers and 23~countries as well as 2~educational networks and 1~network telescope, we analyze how IP address assignment, geography, network, and service-port selection, influence what services are targeted in the cloud. We find that scanners that target cloud compute are selective: they avoid scanning networks without legitimate services and they discriminate between geographic regions. Further, attackers mine Internet-service search engines to find exploitable services and, in some cases, they avoid targeting IANA-assigned protocols, causing researchers to misclassify at least 15\% of traffic on select ports. Based on our results, we derive recommendations for researchers and operators. △ Less

Submitted 28 September, 2023; v1 submitted 23 September, 2023; originally announced September 2023.

Comments: Proceedings of the 2023 ACM Internet Measurement Conference (IMC '23), October 24--26, 2023, Montreal, QC, Canada

arXiv:2308.02068 [pdf, other]

Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale

Authors: Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric

Abstract: Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of… ▽ More Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,334 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically identify and track the narratives spread within online ecosystems. Identifying 52,036 narratives on these 1,334 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and amplify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and to aid fact-checkers in more quickly addressing misinformation. We release code and data at https://github.com/hanshanley/specious-sites. △ Less

Submitted 2 February, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

Comments: Accepted to IEEE S&P 2024. Updated Emails

arXiv:2307.10349 [pdf, other]

Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on Twitter

Authors: Hans W. A. Hanley, Zakir Durumeric

Abstract: Social media platforms are often blamed for exacerbating political polarization and worsening public dialogue. Many claim hyperpartisan users post pernicious content, slanted to their political views, inciting contentious and toxic conversations. However, what factors, actually contribute to increased online toxicity and negative interactions? In this work, we explore the role that political ideol… ▽ More Social media platforms are often blamed for exacerbating political polarization and worsening public dialogue. Many claim hyperpartisan users post pernicious content, slanted to their political views, inciting contentious and toxic conversations. However, what factors, actually contribute to increased online toxicity and negative interactions? In this work, we explore the role that political ideology plays in contributing to toxicity both on an individual user level and a topic level on Twitter. To do this, we train and open-source a DeBERTa-based toxicity detector with a contrastive objective that outperforms the Google Jigsaw Persective Toxicity detector on the Civil Comments test dataset. Then, after collecting 187 million tweets from 55,415 Twitter users, we determine how several account-level characteristics, including political ideology and account age, predict how often each user posts toxic content. Running a linear regression, we find that the diversity of views and the toxicity of the other accounts with which that user engages has a more marked effect on their own toxicity. Namely, toxic comments are correlated with users who engage with a wider array of political views. Performing topic analysis on the toxic content posted by these accounts using the large language model MPNet and a version of the DP-Means clustering algorithm, we find similar behavior across 6,592 individual topics, with conversations on each topic becoming more toxic as a wider diversity of users become involved. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2306.07469 [pdf, other]

Democratizing LEO Satellite Network Measurement

Authors: Liz Izhikevich, Manda Tran, Katherine Izhikevich, Gautam Akiwate, Zakir Durumeric

Abstract: Low Earth Orbit (LEO) satellite networks are quickly gaining traction with promises of impressively low latency, high bandwidth, and global reach. However, the research community knows relatively little about their operation and performance in practice. The obscurity is largely due to the high barrier of entry for measuring LEO networks, which requires deploying specialized hardware or recruiting… ▽ More Low Earth Orbit (LEO) satellite networks are quickly gaining traction with promises of impressively low latency, high bandwidth, and global reach. However, the research community knows relatively little about their operation and performance in practice. The obscurity is largely due to the high barrier of entry for measuring LEO networks, which requires deploying specialized hardware or recruiting large numbers of satellite Internet customers. In this paper, we introduce HitchHiking, a methodology that democratizes global visibility into LEO satellite networks. HitchHiking builds on the observation that Internet-exposed services that use LEO Internet can reveal satellite network architecture and performance, bypassing the need for specialized hardware. We evaluate HitchHiking against ground truth measurements and prior methods, showing that it provides more coverage and accuracy. With HitchHiking, we complete the largest study to date of Starlink network latency, measuring over 2,400 users across 13 countries. We uncover unexpected patterns in latency that surface how LEO routing is more complex than previously understood. Finally, we conclude with recommendations for future research on LEO networks. △ Less

Submitted 12 October, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: Pre-Print

Journal ref: ACM SIGMETRICS/IFIP Performance 2024

arXiv:2305.09820 [pdf, other]

Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites

Authors: Hans W. A. Hanley, Zakir Durumeric

Abstract: As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the firs… ▽ More As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.46 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 57.3% on mainstream websites while increasing by 474% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites. △ Less

Submitted 19 March, 2024; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: Accepted to ICWSM 2024

arXiv:2303.00895 [pdf, other]

doi 10.1145/3544216.3544249

Predicting IPv4 Services Across All Ports

Authors: Liz Izhikevich, Renata Teixeira, Zakir Durumeric

Abstract: Internet-wide scanning is commonly used to understand the topology and security of the Internet. However, IPv4 Internet scans have been limited to scanning only a subset of services -- exhaustively scanning all IPv4 services is too costly and no existing bandwidth-saving frameworks are designed to scan IPv4 addresses across all ports. In this work we introduce GPS, a system that efficiently discov… ▽ More Internet-wide scanning is commonly used to understand the topology and security of the Internet. However, IPv4 Internet scans have been limited to scanning only a subset of services -- exhaustively scanning all IPv4 services is too costly and no existing bandwidth-saving frameworks are designed to scan IPv4 addresses across all ports. In this work we introduce GPS, a system that efficiently discovers Internet services across all ports. GPS runs a predictive framework that learns from extremely small sample sizes and is highly parallelizable, allowing it to quickly find patterns between services across all 65K ports and a myriad of features. GPS computes service predictions in 13 minutes (four orders of magnitude faster than prior work) and finds 92.5% of services across all ports with 131x less bandwidth, and 204x more precision, compared to exhaustive scanning. GPS is the first work to show that, given at least two responsive IP addresses on a port to train from, predicting the majority of services across all ports is possible and practical. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Journal ref: ACM SIGCOMM 2022 Conference (SIGCOMM '22), August 22--26, 2022, Amsterdam, Netherlands

arXiv:2301.11486 [pdf, other]

Sub-Standards and Mal-Practices: Misinformation's Role in Insular, Polarized, and Toxic Interactions

Authors: Hans W. A. Hanley, Zakir Durumeric

Abstract: How do users and communities respond to news from unreliable sources? How does news from these sources change online conversations? In this work, we examine the role of misinformation in sparking political incivility and toxicity on the social media platform Reddit. Utilizing the Google Jigsaw Perspective API to identify toxicity, hate speech, and other forms of incivility, we find that Reddit com… ▽ More How do users and communities respond to news from unreliable sources? How does news from these sources change online conversations? In this work, we examine the role of misinformation in sparking political incivility and toxicity on the social media platform Reddit. Utilizing the Google Jigsaw Perspective API to identify toxicity, hate speech, and other forms of incivility, we find that Reddit comments posted in response to articles on websites known to spread misinformation are 71.4% more likely to be toxic than comments responding to authentic news articles. Identifying specific instances of incivility and utilizing an exponential random graph model, we then show that when reacting to a misinformation story, Reddit users are more likely to be toxic to users of different political beliefs. Finally, utilizing a zero-inflated negative binomial regression, we identify that as the toxicity of subreddits increases, users are more likely to comment on misinformation-related submissions. △ Less

Submitted 19 July, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

arXiv:2301.10880 [pdf, other]

A Golden Age: Conspiracy Theories' Relationship with Misinformation Outlets, News Media, and the Wider Internet

Authors: Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric

Abstract: Do we live in a "Golden Age of Conspiracy Theories?" In the last few decades, conspiracy theories have proliferated on the Internet with some having dangerous real-world consequences. A large contingent of those who participated in the January 6th attack on the US Capitol fervently believed in the QAnon conspiracy theory. In this work, we study the relationships amongst five prominent conspiracy t… ▽ More Do we live in a "Golden Age of Conspiracy Theories?" In the last few decades, conspiracy theories have proliferated on the Internet with some having dangerous real-world consequences. A large contingent of those who participated in the January 6th attack on the US Capitol fervently believed in the QAnon conspiracy theory. In this work, we study the relationships amongst five prominent conspiracy theories (QAnon, COVID, UFO/Aliens, 9/11, and Flat-Earth) and each of their respective relationships to the news media, both authentic news and misinformation. Identifying and publishing a set of 755 different conspiracy theory websites dedicated to our five conspiracy theories, we find that each set often hyperlinks to the same external domains, with COVID and QAnon conspiracy theory websites having the largest amount of shared connections. Examining the role of news media, we further find that not only do outlets known for spreading misinformation hyperlink to our set of conspiracy theory websites more often than authentic news websites but also that this hyperlinking increased dramatically between 2018 and 2021, with the advent of QAnon and the start of COVID-19 pandemic. Using partial Granger-causality, we uncover several positive correlative relationships between the hyperlinks from misinformation websites and the popularity of conspiracy theory websites, suggesting the prominent role that misinformation news outlets play in popularizing many conspiracy theories. △ Less

Submitted 13 November, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: Accepted to CSCW 2023; CSCW version

arXiv:2301.10856 [pdf, other]

Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram

Authors: Hans W. A. Hanley, Zakir Durumeric

Abstract: In response to disinformation and propaganda from Russian online media following the invasion of Ukraine, Russian media outlets such as Russia Today and Sputnik News were banned throughout Europe. To maintain viewership, many of these Russian outlets began to heavily promote their content on messaging services like Telegram. In this work, we study how 16 Russian media outlets interacted with and u… ▽ More In response to disinformation and propaganda from Russian online media following the invasion of Ukraine, Russian media outlets such as Russia Today and Sputnik News were banned throughout Europe. To maintain viewership, many of these Russian outlets began to heavily promote their content on messaging services like Telegram. In this work, we study how 16 Russian media outlets interacted with and utilized 732 Telegram channels throughout 2022. Leveraging the foundational model MPNet, DP-means clustering, and Hawkes processes, we trace how narratives spread between news sites and Telegram channels. We show that news outlets not only propagate existing narratives through Telegram but that they source material from the messaging platform. For example, across the websites in our study, between 2.3% (ura.news) and 26.7% (ukraina.ru) of articles discussed content that originated/resulted from activity on Telegram. Finally, tracking the spread of individual topics, we measure the rate at which news outlets and Telegram channels disseminate content within the Russian media ecosystem, finding that websites like ura.news and Telegram channels such as @genshab are the most effective at disseminating their content. △ Less

Submitted 27 May, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: Accepted to ICWSM 2024 (ICWSM version)

arXiv:2301.04841 [pdf, other]

LZR: Identifying Unexpected Internet Services

Authors: Liz Izhikevich, Renata Teixeira, Zakir Durumeric

Abstract: Internet-wide scanning is a commonly used research technique that has helped uncover real-world attacks, find cryptographic weaknesses, and understand both operator and miscreant behavior. Studies that employ scanning have largely assumed that services are hosted on their IANA-assigned ports, overlooking the study of services on unusual ports. In this work, we investigate where Internet services a… ▽ More Internet-wide scanning is a commonly used research technique that has helped uncover real-world attacks, find cryptographic weaknesses, and understand both operator and miscreant behavior. Studies that employ scanning have largely assumed that services are hosted on their IANA-assigned ports, overlooking the study of services on unusual ports. In this work, we investigate where Internet services are deployed in practice and evaluate the security posture of services on unexpected ports. We show protocol deployment is more diffuse than previously believed and that protocols run on many additional ports beyond their primary IANA-assigned port. For example, only 3% of HTTP and 6% of TLS services run on ports 80 and 443, respectively. Services on non-standard ports are more likely to be insecure, which results in studies dramatically underestimating the security posture of Internet hosts. Building on our observations, we introduce LZR ("Laser"), a system that identifies 99% of identifiable unexpected services in five handshakes and dramatically reduces the time needed to perform application-layer scans on ports with few responsive expected services (e.g., 5500% speedup on 27017/MongoDB). We conclude with recommendations for future studies. △ Less

Submitted 12 January, 2023; originally announced January 2023.

Comments: In 30th USENIX Security Symposium, 2021

arXiv:2301.03946 [pdf, other]

Hate Raids on Twitch: Echoes of the Past, New Modalities, and Implications for Platform Governance

Authors: Catherine Han, Joseph Seering, Deepak Kumar, Jeffrey T. Hancock, Zakir Durumeric

Abstract: In the summer of 2021, users on the livestreaming platform Twitch were targeted by a wave of "hate raids," a form of attack that overwhelms a streamer's chatroom with hateful messages, often through the use of bots and automation. Using a mixed-methods approach, we combine a quantitative measurement of attacks across the platform with interviews of streamers and third-party bot developers. We pres… ▽ More In the summer of 2021, users on the livestreaming platform Twitch were targeted by a wave of "hate raids," a form of attack that overwhelms a streamer's chatroom with hateful messages, often through the use of bots and automation. Using a mixed-methods approach, we combine a quantitative measurement of attacks across the platform with interviews of streamers and third-party bot developers. We present evidence that confirms that some hate raids were highly-targeted, hate-driven attacks, but we also observe another mode of hate raid similar to networked harassment and specific forms of subcultural trolling. We show that the streamers who self-identify as LGBTQ+ and/or Black were disproportionately targeted and that hate raid messages were most commonly rooted in anti-Black racism and antisemitism. We also document how these attacks elicited rapid community responses in both bolstering reactive moderation and develo** proactive mitigations for future attacks. We conclude by discussing how platforms can better prepare for attacks and protect at-risk communities while considering the division of labor between community moderators, tool-builders, and platforms. △ Less

Submitted 12 January, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

arXiv:2210.03016 [pdf, other]

"A Special Operation": A Quantitative Approach to Dissecting and Comparing Different Media Ecosystems' Coverage of the Russo-Ukrainian War

Authors: Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric

Abstract: The coverage of the Russian invasion of Ukraine has varied widely between Western, Russian, and Chinese media ecosystems with propaganda, disinformation, and narrative spins present in all three. By utilizing the normalized pointwise mutual information metric, differential sentiment analysis, word2vec models, and partially labeled Dirichlet allocation, we present a quantitative analysis of the dif… ▽ More The coverage of the Russian invasion of Ukraine has varied widely between Western, Russian, and Chinese media ecosystems with propaganda, disinformation, and narrative spins present in all three. By utilizing the normalized pointwise mutual information metric, differential sentiment analysis, word2vec models, and partially labeled Dirichlet allocation, we present a quantitative analysis of the differences in coverage amongst these three news ecosystems. We find that while the Western press outlets have focused on the military and humanitarian aspects of the war, Russian media have focused on the purported justifications for the "special military operation" such as the presence in Ukraine of "bio-weapons" and "neo-nazis", and Chinese news media have concentrated on the conflict's diplomatic and economic consequences. Detecting the presence of several Russian disinformation narratives in the articles of several Chinese outlets, we finally measure the degree to which Russian media has influenced Chinese coverage across Chinese outlets' news articles, Weibo accounts, and Twitter accounts. Our analysis indicates that since the Russian invasion of Ukraine, Chinese state media outlets have increasingly cited Russian outlets as news sources and spread Russian disinformation narratives. △ Less

Submitted 31 May, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: Accepted to ICWSM 2023

arXiv:2209.02533 [pdf, other]

Understanding Longitudinal Behaviors of Toxic Accounts on Reddit

Authors: Deepak Kumar, Jeff Hancock, Kurt Thomas, Zakir Durumeric

Abstract: Toxic comments are the top form of hate and harassment experienced online. While many studies have investigated the types of toxic comments posted online, the effects that such content has on people, and the impact of potential defenses, no study has captured the long-term behaviors of the accounts that post toxic comments or how toxic comments are operationalized. In this paper, we present a long… ▽ More Toxic comments are the top form of hate and harassment experienced online. While many studies have investigated the types of toxic comments posted online, the effects that such content has on people, and the impact of potential defenses, no study has captured the long-term behaviors of the accounts that post toxic comments or how toxic comments are operationalized. In this paper, we present a longitudinal measurement study of 929K accounts that post toxic comments on Reddit over an 18~month period. Combined, these accounts posted over 14 million toxic comments that encompass insults, identity attacks, threats of violence, and sexual harassment. We explore the impact that these accounts have on Reddit, the targeting strategies that abusive accounts adopt, and the distinct patterns that distinguish classes of abusive accounts. Our analysis forms the foundation for new time-based and graph-based features that can improve automated detection of toxic behavior online and informs the nuanced interventions needed to address each class of abusive account. △ Less

Submitted 6 September, 2022; originally announced September 2022.

arXiv:2205.14484 [pdf, other]

Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War On Reddit

Authors: Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric

Abstract: In the buildup to and in the weeks following the Russian Federation's invasion of Ukraine, Russian state media outlets output torrents of misleading and outright false information. In this work, we study this coordinated information campaign in order to understand the most prominent state media narratives touted by the Russian government to English-speaking audiences. To do this, we first perform… ▽ More In the buildup to and in the weeks following the Russian Federation's invasion of Ukraine, Russian state media outlets output torrents of misleading and outright false information. In this work, we study this coordinated information campaign in order to understand the most prominent state media narratives touted by the Russian government to English-speaking audiences. To do this, we first perform sentence-level topic analysis using the large-language model MPNet on articles published by ten different pro-Russian propaganda websites including the new Russian "fact-checking" website waronfakes.com. Within this ecosystem, we show that smaller websites like katehon.com were highly effective at publishing topics that were later echoed by other Russian sites. After analyzing this set of Russian information narratives, we then analyze their correspondence with narratives and topics of discussion on the r/Russia and 10 other political subreddits. Using MPNet and a semantic search algorithm, we map these subreddits' comments to the set of topics extracted from our set of Russian websites, finding that 39.6% of r/Russia comments corresponded to narratives from pro-Russian propaganda websites compared to 8.86% on r/politics. △ Less

Submitted 30 May, 2023; v1 submitted 28 May, 2022; originally announced May 2022.

Comments: Accepted to ICWSM 2023

arXiv:2111.00703 [pdf, other]

An Empirical Analysis of HTTPS Configuration Security

Authors: Camelia Simoiu, Wilson Nguyen, Zakir Durumeric

Abstract: It is notoriously difficult to securely configure HTTPS, and poor server configurations have contributed to several attacks including the FREAK, Logjam, and POODLE attacks. In this work, we empirically evaluate the TLS security posture of popular websites and endeavor to understand the configuration decisions that operators make. We correlate several sources of influence on sites' security posture… ▽ More It is notoriously difficult to securely configure HTTPS, and poor server configurations have contributed to several attacks including the FREAK, Logjam, and POODLE attacks. In this work, we empirically evaluate the TLS security posture of popular websites and endeavor to understand the configuration decisions that operators make. We correlate several sources of influence on sites' security postures, including software defaults, cloud providers, and online recommendations. We find a fragmented web ecosystem: while most websites have secure configurations, this is largely due to major cloud providers that offer secure defaults. Individually configured servers are more often insecure than not. This may be in part because common resources available to individual operators -- server software defaults and online configuration guides -- are frequently insecure. Our findings highlight the importance of considering SaaS services separately from individually-configured sites in measurement studies, and the need for server software to ship with secure defaults. △ Less

Submitted 1 November, 2021; originally announced November 2021.

arXiv:2106.15715 [pdf, other]

No Calm in The Storm: Investigating QAnon Website Relationships

Authors: Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric

Abstract: QAnon is a far-right conspiracy theory whose followers largely organize online. In this work, we use web crawls seeded from two of the largest QAnon hotbeds on the Internet, Voat and 8kun, to build a QAnon-centered domain-based hyperlink graph. We use this graph to identify, understand, and learn about the set of websites that spread QAnon content online. Specifically, we curate the largest list o… ▽ More QAnon is a far-right conspiracy theory whose followers largely organize online. In this work, we use web crawls seeded from two of the largest QAnon hotbeds on the Internet, Voat and 8kun, to build a QAnon-centered domain-based hyperlink graph. We use this graph to identify, understand, and learn about the set of websites that spread QAnon content online. Specifically, we curate the largest list of QAnon centered websites to date, from which we document the types of QAnon sites, their hosting providers, as well as their popularity. We further analyze QAnon websites' connection to mainstream news and misinformation online, highlighting the outsized role misinformation websites play in spreading the conspiracy. Finally, we leverage the observed relationship between QAnon and misinformation sites to build a highly accurate random forest classifier that distinguishes between misinformation and authentic news sites. Our results demonstrate new and effective ways to study the growing presence of conspiracy theories and misinformation on the Internet. △ Less

Submitted 31 March, 2022; v1 submitted 29 June, 2021; originally announced June 2021.

arXiv:2106.04511 [pdf, other]

Designing Toxic Content Classification for a Diversity of Perspectives

Authors: Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, Michael Bailey

Abstract: In this work, we demonstrate how existing classifiers for identifying toxic comments online fail to generalize to the diverse concerns of Internet users. We survey 17,280 participants to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences. We find that groups historically at-risk of harassment - such as people who identi… ▽ More In this work, we demonstrate how existing classifiers for identifying toxic comments online fail to generalize to the diverse concerns of Internet users. We survey 17,280 participants to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences. We find that groups historically at-risk of harassment - such as people who identify as LGBTQ+ or young adults - are more likely to to flag a random comment drawn from Reddit, Twitter, or 4chan as toxic, as are people who have personally experienced harassment in the past. Based on our findings, we show how current one-size-fits-all toxicity classification algorithms, like the Perspective API from Jigsaw, can improve in accuracy by 86% on average through personalized model tuning. Ultimately, we highlight current pitfalls and new design directions that can improve the equity and efficacy of toxic content classifiers for all users. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Showing 1–26 of 26 results for author: Durumeric, Z