Search | arXiv e-print repository

Good Intentions, Risky Inventions: A Method for Assessing the Risks and Benefits of AI in Mobile and Wearable Uses

Authors: Marios Constantinides, Edyta Bogucka, Sanja Scepanovic, Daniele Quercia

Abstract: Integrating Artificial Intelligence (AI) into mobile and wearables offers numerous benefits at individual, societal, and environmental levels. Yet, it also spotlights concerns over emerging risks. Traditional assessments of risks and benefits have been sporadic, and often require costly expert analysis. We developed a semi-automatic method that leverages Large Language Models (LLMs) to identify AI… ▽ More Integrating Artificial Intelligence (AI) into mobile and wearables offers numerous benefits at individual, societal, and environmental levels. Yet, it also spotlights concerns over emerging risks. Traditional assessments of risks and benefits have been sporadic, and often require costly expert analysis. We developed a semi-automatic method that leverages Large Language Models (LLMs) to identify AI uses in mobile and wearables, classify their risks based on the EU AI Act, and determine their benefits that align with globally recognized long-term sustainable development goals; a manual validation of our method by two experts in mobile and wearable technologies, a legal and compliance expert, and a cohort of nine individuals with legal backgrounds who were recruited from Prolific, confirmed its accuracy to be over 85\%. We uncovered that specific applications of mobile computing hold significant potential in improving well-being, safety, and social equality. However, these promising uses are linked to risks involving sensitive data, vulnerable groups, and automated decision-making. To avoid rejecting these risky yet impactful mobile and wearable uses, we propose a risk assessment checklist for the Mobile HCI community. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 28 pages, 4 figures, 2 tables

arXiv:2401.02191 [pdf, other]

Characterizing Fake News Targeting Corporations

Authors: Ke Zhou, Sanja Scepanovic, Daniele Quercia

Abstract: Misinformation proliferates in the online sphere, with evident impacts on the political and social realms, influencing democratic discourse and posing risks to public health and safety. The corporate world is also a prime target for fake news dissemination. While recent studies have attempted to characterize corporate misinformation and its effects on companies, their findings often suffer from li… ▽ More Misinformation proliferates in the online sphere, with evident impacts on the political and social realms, influencing democratic discourse and posing risks to public health and safety. The corporate world is also a prime target for fake news dissemination. While recent studies have attempted to characterize corporate misinformation and its effects on companies, their findings often suffer from limitations due to qualitative or narrative approaches and a narrow focus on specific industries. To address this gap, we conducted an analysis utilizing social media quantitative methods and crowd-sourcing studies to investigate corporate misinformation across a diverse array of industries within the S\&P 500 companies. Our study reveals that corporate misinformation encompasses topics such as products, politics, and societal issues. We discovered companies affected by fake news also get reputable news coverage but less social media attention, leading to heightened negativity in social media comments, diminished stock growth, and increased stress mentions among employee reviews. Additionally, we observe that a company is not targeted by fake news all the time, but there are particular times when a critical mass of fake news emerges. These findings hold significant implications for regulators, business leaders, and investors, emphasizing the necessity to vigilantly monitor the escalating phenomenon of corporate misinformation. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: Accepted in ICWSM 2024

arXiv:2307.04167 [pdf, other]

Dream Content Discovery from Reddit with an Unsupervised Mixed-Method Approach

Authors: Anubhab Das, Sanja Šćepanović, Luca Maria Aiello, Remington Mallett, Deirdre Barrett, Daniele Quercia

Abstract: Dreaming is a fundamental but not fully understood part of human experience that can shed light on our thought patterns. Traditional dream analysis practices, while popular and aided by over 130 unique scales and rating systems, have limitations. Mostly based on retrospective surveys or lab studies, they struggle to be applied on a large scale or to show the importance and connections between diff… ▽ More Dreaming is a fundamental but not fully understood part of human experience that can shed light on our thought patterns. Traditional dream analysis practices, while popular and aided by over 130 unique scales and rating systems, have limitations. Mostly based on retrospective surveys or lab studies, they struggle to be applied on a large scale or to show the importance and connections between different dream themes. To overcome these issues, we developed a new, data-driven mixed-method approach for identifying topics in free-form dream reports through natural language processing. We tested this method on 44,213 dream reports from Reddit's r/Dreams subreddit, where we found 217 topics, grouped into 22 larger themes: the most extensive collection of dream topics to date. We validated our topics by comparing it to the widely-used Hall and van de Castle scale. Going beyond traditional scales, our method can find unique patterns in different dream types (like nightmares or recurring dreams), understand topic importance and connections, and observe changes in collective dream experiences over time and around major events, like the COVID-19 pandemic and the recent Russo-Ukrainian war. We envision that the applications of our method will provide valuable insights into the intricate nature of dreaming. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: 20 pages, 6 figures, 4 tables, 4 pages of supplementary information

ACM Class: H.4.0; K.4.0

arXiv:2304.11020 [pdf, other]

Heart Rate Extraction from Abdominal Audio Signals

Authors: Jake Stuchbury-Wass, Erika Bondareva, Kayla-Jade Butkow, Sanja Scepanovic, Zoran Radivojevic, Cecilia Mascolo

Abstract: Abdominal sounds (ABS) have been traditionally used for assessing gastrointestinal (GI) disorders. However, the assessment requires a trained medical professional to perform multiple abdominal auscultation sessions, which is resource-intense and may fail to provide an accurate picture of patients' continuous GI wellbeing. This has generated a technological interest in develo** wearables for cont… ▽ More Abdominal sounds (ABS) have been traditionally used for assessing gastrointestinal (GI) disorders. However, the assessment requires a trained medical professional to perform multiple abdominal auscultation sessions, which is resource-intense and may fail to provide an accurate picture of patients' continuous GI wellbeing. This has generated a technological interest in develo** wearables for continuous capture of ABS, which enables a fuller picture of patient's GI status to be obtained at reduced cost. This paper seeks to evaluate the feasibility of extracting heart rate (HR) from such ABS monitoring devices. The collection of HR directly from these devices would enable gathering vital signs alongside GI data without the need for additional wearable devices, providing further cost benefits and improving general usability. We utilised a dataset containing 104 hours of ABS audio, collected from the abdomen using an e-stethoscope, and electrocardiogram as ground truth. Our evaluation shows for the first time that we can successfully extract HR from audio collected from a wearable on the abdomen. As heart sounds collected from the abdomen suffer from significant noise from GI and respiratory tracts, we leverage wavelet denoising for improved heart beat detection. The mean absolute error of the algorithm for average HR is 3.4 BPM with mean directional error of -1.2 BPM over the whole dataset. A comparison to photoplethysmography-based wearable HR sensors shows that our approach exhibits comparable accuracy to consumer wrist-worn wearables for average and instantaneous heart rate. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: ICASSP 2023

arXiv:2205.10161 [pdf, other]

The role of the Big Geographic Sort in the circulation of misinformation among U.S. Reddit users

Authors: Lia Bozarth, Daniele Quercia, Licia Capra, Sanja Scepanovic

Abstract: Past research has attributed the online circulation of misinformation to two main factors - individual characteristics (e.g., a person's information literacy) and social media effects (e.g., algorithm-mediated information diffusion) - and has overlooked a third one: the critical mass created by the offline self-segregation of Americans into like-minded geographical regions such as states (a phenom… ▽ More Past research has attributed the online circulation of misinformation to two main factors - individual characteristics (e.g., a person's information literacy) and social media effects (e.g., algorithm-mediated information diffusion) - and has overlooked a third one: the critical mass created by the offline self-segregation of Americans into like-minded geographical regions such as states (a phenomenon called "The Big Sort"). We hypothesized that this latter factor matters for the online spreading of misinformation not least because online interactions, despite having the potential of being global, end up being localized: interaction probability is known to rapidly decay with distance. Upon analysis of more than 8M Reddit comments containing news links spanning four years, from January 2016 to December 2019, we found that Reddit did not work as an "hype machine" for misinformation (as opposed to what previous work reported for other platforms, circulation was not mainly caused by platform-facilitated network effects) but worked as a supply-and-demand system: misinformation news items scaled linearly with the number of users in each state (with a scaling exponent beta=1, and a goodness of fit R2 = 0.95). Furthermore, deviations from such a universal pattern were best explained by state-level personality and cultural factors (R2 = {0.12, 0.39}), rather than socioeconomic conditions (R2 = {0.15, 0.29}) or, as one would expect, political characteristics (R2 ={0.06, 0.21}). Higher-than-expected circulation of any type of news (including reputable news) was found in states characterised by residents who tend to be less diligent in terms of their personality (low in conscientiousness) and by loose cultures understating the importance of adherence to norms (low in cultural tightness). △ Less

Submitted 20 May, 2022; originally announced May 2022.

arXiv:2205.01217 [pdf, other]

Insider Stories: Analyzing Internal Sustainability Efforts of Major US Companies from Online Reviews

Authors: Indira Sen, Daniele Quercia, Licia Capra, Matteo Montecchi, Sanja Šćepanović

Abstract: It is hard to establish whether a company supports internal sustainability efforts (ISEs) like gender equality, diversity, and general staff welfare, not least because of lack of methodologies operationalizing these internal sustainability practices, and of data honestly documenting such efforts. We developed and validated a six-dimension framework reflecting Internal Sustainability Efforts (ISEs)… ▽ More It is hard to establish whether a company supports internal sustainability efforts (ISEs) like gender equality, diversity, and general staff welfare, not least because of lack of methodologies operationalizing these internal sustainability practices, and of data honestly documenting such efforts. We developed and validated a six-dimension framework reflecting Internal Sustainability Efforts (ISEs), gathered more than 350K employee reviews of 104 major companies across the whole US for the (2008-2020) years, and developed a deep-learning framework scoring these reviews in terms of the six ISEs. Commitment to ISEs manifested itself at micro-level -- companies scoring high in ISEs enjoyed high stock growth. This new conceptualization of ISEs offers both theoretical implications for the literature in corporate sustainability, and practical implications for companies and policymakers. To further explore these implications, researchers need to add potentially missing ISEs, to do so for more companies, and establish the causal relationship between company success and ISEs. △ Less

Submitted 13 April, 2023; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: 9 pages + 15 pages of appendix, to appear in Humanities & Social Sciences Communications

arXiv:2202.01176 [pdf]

doi 10.1098/rsos.211080

Epidemic Dreams: Dreaming about health during the COVID-19 pandemic

Authors: Sanja Šćepanović, Luca Maria Aiello, Deirdre Barrett, Daniele Quercia

Abstract: The continuity hypothesis of dreams suggests that the content of dreams is continuous with the dreamer's waking experiences. Given the unprecedented nature of the experiences during COVID-19, we studied the continuity hypothesis in the context of the pandemic. We implemented a deep-learning algorithm that can extract mentions of medical conditions from text and applied it to two datasets collected… ▽ More The continuity hypothesis of dreams suggests that the content of dreams is continuous with the dreamer's waking experiences. Given the unprecedented nature of the experiences during COVID-19, we studied the continuity hypothesis in the context of the pandemic. We implemented a deep-learning algorithm that can extract mentions of medical conditions from text and applied it to two datasets collected during the pandemic: 2,888 dream reports (dreaming life experiences), and 57M tweets mentioning the pandemic (waking life experiences). The health expressions common to both sets were typical COVID-19 symptoms (e.g., cough, fever, and anxiety), suggesting that dreams reflected people's real-world experiences. The health expressions that distinguished the two sets reflected differences in thought processes: expressions in waking life reflected a linear and logical thought process and, as such, described realistic symptoms or related disorders (e.g., nasal pain, SARS, H1N1); those in dreaming life reflected a thought process closer to the visual and emotional spheres and, as such, described either conditions unrelated to the virus (e.g., maggots, deformities, snakebites), or conditions of surreal nature (e.g., teeth falling out, body crumbling into sand). Our results confirm that dream reports represent an understudied yet valuable source of people's health experiences in the real world. △ Less

Submitted 2 February, 2022; originally announced February 2022.

arXiv:2109.05930 [pdf, other]

doi 10.1145/3432234

ComFeel: Productivity is a Matter of the Senses Too

Authors: Marios Constantinides, Sanja Šćepanović, Daniele Quercia, Hongwei Li, Ugo Sassi, Michael Eggleston

Abstract: Indoor environmental quality has been found to impact employees' productivity in the long run, yet it is unclear its meeting-level impact in the short term. We studied the relationship between sensorial pleasantness of a meeting's room and the meeting's productivity. By administering a 28-item questionnaire to 363 online participants, we indeed found that three factors captured 62% of people's exp… ▽ More Indoor environmental quality has been found to impact employees' productivity in the long run, yet it is unclear its meeting-level impact in the short term. We studied the relationship between sensorial pleasantness of a meeting's room and the meeting's productivity. By administering a 28-item questionnaire to 363 online participants, we indeed found that three factors captured 62% of people's experience of meetings: (a) productivity; (b) psychological safety; and (c) room pleasantness. To measure room pleasantness, we developed and deployed ComFeel, an indoor environmental sensing infrastructure, which captures light, temperature, and gas resistance readings through miniaturized and unobtrusive devices we built and named 'Geckos'. Across 29 real-world meetings, using ComFeel, we collected 1373 minutes of readings. For each of these meetings, we also collected whether each participant felt the meeting to have been productive, the setting to be psychologically safe, and the meeting room to be pleasant. As one expects, we found that, on average, the probability of a meeting being productive increased by 35% for each standard deviation increase in the psychological safety participants experienced. Importantly, that probability increased by as much as 25% for each increase in room pleasantness, confirming the significant short-term impact of the indoor environment on meetings' productivity. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: 21 pages, 7 figures, 5 tables

Journal ref: IMWUT: 2020, 4(4), 123

arXiv:2107.12362 [pdf, other]

Pressure Test: Quantifying the impact of positive stress on companies from online employee reviews

Authors: Sanja Šćepanović, Marios Constantinides, Daniele Quercia, Seunghyun Kim

Abstract: Workplace stress is often considered to be negative, yet lab studies on individuals suggest that not all stress is bad. There are two types of stress: distress refers to harmful stimuli, while eustress refers to healthy, euphoric stimuli that create a sense of fulfillment and achievement. Telling the two types of stress apart is challenging, let alone quantifying their impact across corporations.… ▽ More Workplace stress is often considered to be negative, yet lab studies on individuals suggest that not all stress is bad. There are two types of stress: distress refers to harmful stimuli, while eustress refers to healthy, euphoric stimuli that create a sense of fulfillment and achievement. Telling the two types of stress apart is challenging, let alone quantifying their impact across corporations. By leveraging a dataset of 440K reviews about S&P 500 companies published during twelve successive years, we developed a deep learning framework to extract stress mentions from these reviews. We proposed a new methodology that places each company on a stress-by-rating quadrant (based on its overall stress score and overall rating on the site), and accordingly scores the company to be, on average, either a low stress}, passive, negative stress, or positive stress company. We found that (former) employees of positive stress companies tended to describe high-growth and collaborative workplaces in their reviews, and that such companies' stock evaluations grew, on average, 5.1 times in 10 years (2009-2019) as opposed to the companies of the other three stress types that grew, on average, 3.7 times in the same time period. We also found that the four stress scores aggregated every year -- from 2008 to 2020 -- closely followed the unemployment rate in the U.S.: a year of positive stress (2008) was rapidly followed by several years of negative stress (2009-2015), which peaked during the Great Recession (2009-2011). These results suggest that automated analyses of the language used by employees on corporate social-networking tools offer yet another way of tracking workplace stress, allowing quantification of its impact on corporations. △ Less

Submitted 21 December, 2022; v1 submitted 26 July, 2021; originally announced July 2021.

Comments: 22 pages, 15 figures, 6 tables

ACM Class: H.4

arXiv:2103.01169 [pdf, other]

The Healthy States of America: Creating a Health Taxonomy with Social Media

Authors: Sanja Scepanovic, Luca Maria Aiello, Ke Zhou, Sagar Joglekar, Daniele Quercia

Abstract: Since the uptake of social media, researchers have mined online discussions to track the outbreak and evolution of specific diseases or chronic conditions such as influenza or depression. To broaden the set of diseases under study, we developed a Deep Learning tool for Natural Language Processing that extracts mentions of virtually any medical condition or disease from unstructured social media te… ▽ More Since the uptake of social media, researchers have mined online discussions to track the outbreak and evolution of specific diseases or chronic conditions such as influenza or depression. To broaden the set of diseases under study, we developed a Deep Learning tool for Natural Language Processing that extracts mentions of virtually any medical condition or disease from unstructured social media text. With that tool at hand, we processed Reddit and Twitter posts, analyzed the clusters of the two resulting co-occurrence networks of conditions, and discovered that they correspond to well-defined categories of medical conditions. This resulted in the creation of the first comprehensive taxonomy of medical conditions automatically derived from online discussions. We validated the structure of our taxonomy against the official International Statistical Classification of Diseases and Related Health Problems (ICD-11), finding matches of our clusters with 20 official categories, out of 22. Based on the mentions of our taxonomy's sub-categories on Reddit posts geo-referenced in the U.S., we were then able to compute disease-specific health scores. As opposed to counts of disease mentions or counts with no knowledge of our taxonomy's structure, we found that our disease-specific health scores are causally linked with the officially reported prevalence of 18 conditions. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: In proceedings of the International Conference on Web and Social Media (ICWSM'21)

arXiv:2102.00848 [pdf, other]

Jane Jacobs in the Sky: Predicting Urban Vitality with Open Satellite Data

Authors: Sanja Šćepanović, Sagar Joglekar, Stephen Law, Daniele Quercia

Abstract: The presence of people in an urban area throughout the day -- often called 'urban vitality' -- is one of the qualities world-class cities aspire to the most, yet it is one of the hardest to achieve. Back in the 1970s, Jane Jacobs theorized urban vitality and found that there are four conditions required for the promotion of life in cities: diversity of land use, small block sizes, the mix of econo… ▽ More The presence of people in an urban area throughout the day -- often called 'urban vitality' -- is one of the qualities world-class cities aspire to the most, yet it is one of the hardest to achieve. Back in the 1970s, Jane Jacobs theorized urban vitality and found that there are four conditions required for the promotion of life in cities: diversity of land use, small block sizes, the mix of economic activities, and concentration of people. To build proxies for those four conditions and ultimately test Jane Jacobs's theory at scale, researchers have had to collect both private and public data from a variety of sources, and that took decades. Here we propose the use of one single source of data, which happens to be publicly available: Sentinel-2 satellite imagery. In particular, since the first two conditions (diversity of land use and small block sizes) are visible to the naked eye from satellite imagery, we tested whether we could automatically extract them with a state-of-the-art deep-learning framework and whether, in the end, the extracted features could predict vitality. In six Italian cities for which we had call data records, we found that our framework is able to explain on average 55% of the variance in urban vitality extracted from those records. △ Less

Submitted 28 January, 2021; originally announced February 2021.

arXiv:2010.06296 [pdf, other]

Humane Visual AI: Telling the Stories Behind a Medical Condition

Authors: Wonyoung So, Edyta P. Bogucka, Sanja Šćepanović, Sagar Joglekar, Ke Zhou, Daniele Quercia

Abstract: A biological understanding is key for managing medical conditions, yet psychological and social aspects matter too. The main problem is that these two aspects are hard to quantify and inherently difficult to communicate. To quantify psychological aspects, this work mined around half a million Reddit posts in the sub-communities specialised in 14 medical conditions, and it did so with a new deep-le… ▽ More A biological understanding is key for managing medical conditions, yet psychological and social aspects matter too. The main problem is that these two aspects are hard to quantify and inherently difficult to communicate. To quantify psychological aspects, this work mined around half a million Reddit posts in the sub-communities specialised in 14 medical conditions, and it did so with a new deep-learning framework. In so doing, it was able to associate mentions of medical conditions with those of emotions. To then quantify social aspects, this work designed a probabilistic approach that mines open prescription data from the National Health Service in England to compute the prevalence of drug prescriptions, and to relate such a prevalence to census data. To finally visually communicate each medical condition's biological, psychological, and social aspects through storytelling, we designed a narrative-style layered Martini Glass visualization. In a user study involving 52 participants, after interacting with our visualization, a considerable number of them changed their mind on previously held opinions: 10% gave more importance to the psychological aspects of medical conditions, and 27% were more favourable to the use of social media data in healthcare, suggesting the importance of persuasive elements in interactive visualizations. △ Less

Submitted 13 October, 2020; originally announced October 2020.

arXiv:2007.13169 [pdf, other]

How Epidemic Psychology Works on Twitter: Evolution of responses to the COVID-19 pandemic in the U.S

Authors: Luca Maria Aiello, Daniele Quercia, Ke Zhou, Marios Constantinides, Sanja Šćepanović, Sagar Joglekar

Abstract: Disruptions resulting from an epidemic might often appear to amount to chaos but, in reality, can be understood in a systematic way through the lens of "epidemic psychology". According to Philip Strong, the founder of the sociological study of epidemic infectious diseases, not only is an epidemic biological; there is also the potential for three psycho-social epidemics: of fear, moralization, and… ▽ More Disruptions resulting from an epidemic might often appear to amount to chaos but, in reality, can be understood in a systematic way through the lens of "epidemic psychology". According to Philip Strong, the founder of the sociological study of epidemic infectious diseases, not only is an epidemic biological; there is also the potential for three psycho-social epidemics: of fear, moralization, and action. This work empirically tests Strong's model at scale by studying the use of language of 122M tweets related to the COVID-19 pandemic posted in the U.S. during the whole year of 2020. On Twitter, we identified three distinct phases. Each of them is characterized by different regimes of the three psycho-social epidemics. In the refusal phase, users refused to accept reality despite the increasing number of deaths in other countries. In the anger phase (started after the announcement of the first death in the country), users' fear translated into anger about the looming feeling that things were about to change. Finally, in the acceptance phase, which began after the authorities imposed physical-distancing measures, users settled into a "new normal" for their daily activities. Overall, refusal of accepting reality gradually died off as the year went on, while acceptance increasingly took hold. During 2020, as cases surged in waves, so did anger, re-emerging cyclically at each wave. Our real-time operationalization of Strong's model is designed in a way that makes it possible to embed epidemic psychology into real-time models (e.g., epidemiological and mobility models). △ Less

Submitted 20 July, 2021; v1 submitted 26 July, 2020; originally announced July 2020.

Comments: Humanities and Social Sciences Communications. 24 pages, 7 figures, 4 tables

ACM Class: H.4

arXiv:1912.05067 [pdf, other]

Wide-Area Land Cover Map** with Sentinel-1 Imagery using Deep Learning Semantic Segmentation Models

Authors: Sanja Šćepanović, Oleg Antropov, Pekka Laurila, Yrjö Rauste, Vladimir Ignatenko, Jaan Praks

Abstract: Land cover map** is essential to monitoring the environment and understanding the effects of human activities on it. The automatic approaches to land cover map** (i.e., image segmentation) mostly used traditional machine learning that requires heuristic feature design. On natural images, deep learning has outperformed traditional machine learning approaches for image segmentation. On remote se… ▽ More Land cover map** is essential to monitoring the environment and understanding the effects of human activities on it. The automatic approaches to land cover map** (i.e., image segmentation) mostly used traditional machine learning that requires heuristic feature design. On natural images, deep learning has outperformed traditional machine learning approaches for image segmentation. On remote sensing images, recent studies demonstrate successful applications of specific deep learning models to small-scale land cover map** tasks (e.g., to classify wetland complexes). However, it is not readily clear which of the existing models are the best candidates for which remote sensing task. In this study, we answer that question for map** the fundamental land cover classes using satellite radar data. We took Sentinel-1 C-band SAR images available at no cost to users as representative data. CORINE land cover map was used as a reference, and the models were trained to distinguish between the 5 major CORINE classes. We selected seven among the state-of-the-art semantic segmentation models so that they cover a diverse set of approaches: U-Net, DeepLabV3+, PSPNet, BiSeNet, SegNet, FC-DenseNet, and FRRN-B. The models were pre-trained on the ImageNet dataset and further fine-tuned in this study. All the models demonstrated solid performance with overall accuracy between 87.9% and 93.1%, and with good to a very good agreement (kappa statistic between 0.75 and 0.86). The two best models were FC-DenseNet and SegNet, with the latter having a much smaller inference time. Overall, our results indicate that the semantic segmentation models are suitable for efficient wide-area map** using satellite SAR imagery and also provide baseline accuracy against which the newly proposed models should be evaluated. △ Less

Submitted 23 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

arXiv:1707.06071 [pdf, other]

Malware distributions and graph structure of the Web

Authors: Sanja Šćepanović, Igor Mishkovski, Jukka Ruohonen, Frederick Ayala-Gómez, Tuomas Aura, Sami Hyrynsalmi

Abstract: Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science m… ▽ More Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science methods on a large crawl of surface and deep Web pages with the aim to increase such knowledge. To accomplish this, we answer the following questions. Which theoretical distributions explain important local characteristics and network properties of websites? How are these characteristics and properties different between clean and malicious (malware-affected) websites? What is the prediction power of local characteristics and network properties to classify malware websites? To the best of our knowledge, this is the first large-scale study describing the differences in global properties between malicious and clean parts of the Web. In other words, our work is building on and bridging the gap between \textit{Web science} that tackles large-scale graph representations and \textit{Web cyber security} that is concerned with malicious activities on the Web. The results presented herein can also help antivirus vendors in devising approaches to improve their detection algorithms. △ Less

Submitted 19 July, 2017; originally announced July 2017.

arXiv:1606.08207 [pdf, other]

Semantic homophily in online communication: evidence from Twitter

Authors: Sanja Šćepanović, Igor Mishkovski, Bruno Gonçalves, Nguyen Trung Hieu, Pan Hui

Abstract: People are observed to assortatively connect on a set of traits. This phenomenon, termed assortative mixing or sometimes homophily, can be quantified through assortativity coefficient in social networks. Uncovering the exact causes of strong assortative mixing found in social networks has been a research challenge. Among the main suggested causes from sociology are the tendency of similar individu… ▽ More People are observed to assortatively connect on a set of traits. This phenomenon, termed assortative mixing or sometimes homophily, can be quantified through assortativity coefficient in social networks. Uncovering the exact causes of strong assortative mixing found in social networks has been a research challenge. Among the main suggested causes from sociology are the tendency of similar individuals to connect (often itself referred as homophily) and the social influence among already connected individuals. An important question to researchers and in practice can be tackled, as we present here: understanding the exact mechanisms of interplay between these tendencies and the underlying social network structure. Namely, in addition to the mentioned assortativity coefficient, there are several other static and temporal network properties and substructures that can be linked to the tendencies of homophily and social influence in the social network and we herein investigate those. Concretely, we tackle a computer-mediated \textit{communication network} (based on Twitter mentions) and a particular type of assortative mixing that can be inferred from the semantic features of communication content that we term \textit{semantic homophily}. Our work, to the best of our knowledge, is the first to offer an in-depth analysis on semantic homophily in a communication network and the interplay between them. We quantify diverse levels of semantic homophily, identify the semantic aspects that are the drivers of observed homophily, show insights in its temporal evolution and finally, we present its intricate interplay with the communication network on Twitter. By analyzing these mechanisms we increase understanding on what are the semantic aspects that shape and how they shape the human computer-mediated communication. △ Less

Submitted 20 March, 2017; v1 submitted 27 June, 2016; originally announced June 2016.

Comments: 19 pages, 11 figures, 7 tables

Showing 1–16 of 16 results for author: Scepanovic, S