Search | arXiv e-print repository

Metacognitive AI: Framework and the Case for a Neurosymbolic Approach

Authors: Hua Wei, Paulo Shakarian, Christian Lebiere, Bruce Draper, Nikhil Krishnaswamy, Sergei Nirenburg

Abstract: Metacognition is the concept of reasoning about an agent's own internal processes and was originally introduced in the field of developmental psychology. In this position paper, we examine the concept of applying metacognition to artificial intelligence. We introduce a framework for understanding metacognitive artificial intelligence (AI) that we call TRAP: transparency, reasoning, adaptation, and… ▽ More Metacognition is the concept of reasoning about an agent's own internal processes and was originally introduced in the field of developmental psychology. In this position paper, we examine the concept of applying metacognition to artificial intelligence. We introduce a framework for understanding metacognitive artificial intelligence (AI) that we call TRAP: transparency, reasoning, adaptation, and perception. We discuss each of these aspects in-turn and explore how neurosymbolic AI (NSAI) can be leveraged to address challenges of metacognition. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2405.10345 [pdf, other]

Machine Learning Driven Biomarker Selection for Medical Diagnosis

Authors: Divyagna Bavikadi, Ayushi Agarwal, Shashank Ganta, Yunro Chung, Lusheng Song, Ji Qiu, Paulo Shakarian

Abstract: Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely unde… ▽ More Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 4 different machine learning (ML) classifiers for identifying correlations, evaluating 16 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2310.06835 [pdf, other]

Scalable Semantic Non-Markovian Simulation Proxy for Reinforcement Learning

Authors: Kaustuv Mukherji, Devendra Parkar, Lahari Pokala, Dyuman Aditya, Paulo Shakarian, Clark Dorman

Abstract: Recent advances in reinforcement learning (RL) have shown much promise across a variety of applications. However, issues such as scalability, explainability, and Markovian assumptions limit its applicability in certain domains. We observe that many of these shortcomings emanate from the simulator as opposed to the RL training algorithms themselves. As such, we propose a semantic proxy for simulati… ▽ More Recent advances in reinforcement learning (RL) have shown much promise across a variety of applications. However, issues such as scalability, explainability, and Markovian assumptions limit its applicability in certain domains. We observe that many of these shortcomings emanate from the simulator as opposed to the RL training algorithms themselves. As such, we propose a semantic proxy for simulation based on a temporal extension to annotated logic. In comparison with two high-fidelity simulators, we show up to three orders of magnitude speed-up while preserving the quality of policy learned. In addition, we show the ability to model and leverage non-Markovian dynamics and instantaneous actions while providing an explainable trace describing the outcomes of the agent actions. △ Less

Submitted 14 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: Submitted to 2024 IEEE International Conference on Semantic Computing

arXiv:2308.14250 [pdf, other]

Rule-Based Error Detection and Correction to Operationalize Movement Trajectory Classification

Authors: Bowen Xi, Kevin Scaria, Paulo Shakarian

Abstract: Classification of movement trajectories has many applications in transportation. Supervised neural models represent the current state-of-the-art. Recent security applications require this task to be rapidly employed in environments that may differ from the data used to train such models for which there is little training data. We provide a neuro-symbolic rule-based framework to conduct error corre… ▽ More Classification of movement trajectories has many applications in transportation. Supervised neural models represent the current state-of-the-art. Recent security applications require this task to be rapidly employed in environments that may differ from the data used to train such models for which there is little training data. We provide a neuro-symbolic rule-based framework to conduct error correction and detection of these models to support eventual deployment in security applications. We provide a suite of experiments on several recent and state-of-the-art models and show an accuracy improvement of 1.7% over the SOTA model in the case where all classes are present in training and when 40% of classes are omitted from training, we obtain a 5.2% improvement (zero-shot) and 23.9% (few-shot) improvement over the SOTA model without resorting to retraining of the base model. △ Less

Submitted 30 April, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

arXiv:2308.11189 [pdf, other]

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

Authors: Noel Ngu, Nathaniel Lee, Paulo Shakarian

Abstract: Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be e… ▽ More Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Report number: Accepted to IEEE ICSC '24

arXiv:2302.13814 [pdf, other]

An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)

Authors: Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu

Abstract: We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does… ▽ More We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area. △ Less

Submitted 27 February, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Journal ref: AAAI Spring Symposium 2023 (MAKE)

arXiv:2302.13482 [pdf, other]

PyReason: Software for Open World Temporal Logic

Authors: Dyuman Aditya, Kaustuv Mukherji, Srikar Balasubramanian, Abhiraj Chaudhary, Paulo Shakarian

Abstract: The growing popularity of neuro symbolic reasoning has led to the adoption of various forms of differentiable (i.e., fuzzy) first order logic. We introduce PyReason, a software framework based on generalized annotated logic that both captures the current cohort of differentiable logics and temporal extensions to support inference over finite periods of time with capabilities for open world reasoni… ▽ More The growing popularity of neuro symbolic reasoning has led to the adoption of various forms of differentiable (i.e., fuzzy) first order logic. We introduce PyReason, a software framework based on generalized annotated logic that both captures the current cohort of differentiable logics and temporal extensions to support inference over finite periods of time with capabilities for open world reasoning. Further, PyReason is implemented to directly support reasoning over graphical structures (e.g., knowledge graphs, social networks, biological networks, etc.), produces fully explainable traces of inference, and includes various practical features such as type checking and a memory-efficient implementation. This paper reviews various extensions of generalized annotated logic integrated into our implementation, our modern, efficient Python-based implementation that conducts exact yet scalable deductive inference, and a suite of experiments. PyReason is available at: github.com/lab-v2/pyreason. △ Less

Submitted 4 March, 2023; v1 submitted 26 February, 2023; originally announced February 2023.

Comments: Equal contributions from first two authors. Accepted at 2023 AAAI Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering (AAAI: MAKE)

arXiv:2302.12195 [pdf, other]

Extensions to Generalized Annotated Logic and an Equivalent Neural Architecture

Authors: Paulo Shakarian, Gerardo I. Simari

Abstract: While deep neural networks have led to major advances in image recognition, language translation, data mining, and game playing, there are well-known limits to the paradigm such as lack of explainability, difficulty of incorporating prior knowledge, and modularity. Neuro symbolic hybrid systems have recently emerged as a straightforward way to extend deep neural networks by incorporating ideas fro… ▽ More While deep neural networks have led to major advances in image recognition, language translation, data mining, and game playing, there are well-known limits to the paradigm such as lack of explainability, difficulty of incorporating prior knowledge, and modularity. Neuro symbolic hybrid systems have recently emerged as a straightforward way to extend deep neural networks by incorporating ideas from symbolic reasoning such as computational logic. In this paper, we propose a list desirable criteria for neuro symbolic systems and examine how some of the existing approaches address these criteria. We then propose an extension to generalized annotated logic that allows for the creation of an equivalent neural architecture comprising an alternate neuro symbolic hybrid. However, unlike previous approaches that rely on continuous optimization for the training process, our framework is designed as a binarized neural network that uses discrete optimization. We provide proofs of correctness and discuss several of the challenges that must be overcome to realize this framework in an implemented system. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: Accepted to IEEE TransAI, 2022

arXiv:2209.15067 [pdf, ps, other]

Reasoning about Complex Networks: A Logic Programming Approach

Authors: Paulo Shakarian, Gerardo I. Simari, Devon Callahan

Abstract: Reasoning about complex networks has in recent years become an important topic of study due to its many applications: the adoption of commercial products, spread of disease, the diffusion of an idea, etc. In this paper, we present the MANCaLog language, a formalism based on logic programming that satisfies a set of desiderata proposed in previous work as recommendations for the development of appr… ▽ More Reasoning about complex networks has in recent years become an important topic of study due to its many applications: the adoption of commercial products, spread of disease, the diffusion of an idea, etc. In this paper, we present the MANCaLog language, a formalism based on logic programming that satisfies a set of desiderata proposed in previous work as recommendations for the development of approaches to reasoning in complex networks. To the best of our knowledge, this is the first formalism that satisfies all such criteria. We first focus on algorithms for finding minimal models (on which multi-attribute analysis can be done), and then on how this formalism can be applied in certain real world scenarios. Towards this end, we study the problem of deciding group membership in social networks: given a social network and a set of groups where group membership of only some of the individuals in the network is known, we wish to determine a degree of membership for the remaining group-individual pairs. We develop a prototype implementation that we use to obtain experimental results on two real world datasets, including a current social network of criminal gangs in a major U.S.\ city. We then show how the assignment of degree of membership to nodes in this case allows for a better understanding of the criminal gang problem when combined with other social network mining techniques -- including detection of sub-groups and identification of core group members -- which would not be possible without further identification of additional group members. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:1301.0302

arXiv:2001.04624 [pdf, other]

A Feature-Driven Approach for Identifying Pathogenic Social Media Accounts

Authors: Hamidreza Alvari, Ghazaleh Beigi, Soumajyoti Sarkar, Scott W. Ruston, Steven R. Corman, Hasan Davulcu, Paulo Shakarian

Abstract: Over the past few years, we have observed different media outlets' attempts to shift public opinion by framing information to support a narrative that facilitate their goals. Malicious users referred to as "pathogenic social media" (PSM) accounts are more likely to amplify this phenomena by spreading misinformation to viral proportions. Understanding the spread of misinformation from account-level… ▽ More Over the past few years, we have observed different media outlets' attempts to shift public opinion by framing information to support a narrative that facilitate their goals. Malicious users referred to as "pathogenic social media" (PSM) accounts are more likely to amplify this phenomena by spreading misinformation to viral proportions. Understanding the spread of misinformation from account-level perspective is thus a pressing problem. In this work, we aim to present a feature-driven approach to detect PSM accounts in social media. Inspired by the literature, we set out to assess PSMs from three broad perspectives: (1) user-related information (e.g., user activity, profile characteristics), (2) source-related information (i.e., information linked via URLs shared by users) and (3) content-related information (e.g., tweets characteristics). For the user-related information, we investigate malicious signals using causality analysis (i.e., if user is frequently a cause of viral cascades) and profile characteristics (e.g., number of followers, etc.). For the source-related information, we explore various malicious properties linked to URLs (e.g., URL address, content of the associated website, etc.). Finally, for the content-related information, we examine attributes (e.g., number of hashtags, suspicious hashtags, etc.) from tweets posted by users. Experiments on real-world Twitter data from different countries demonstrate the effectiveness of the proposed approach in identifying PSM users. △ Less

Submitted 13 January, 2020; originally announced January 2020.

arXiv:1909.11592 [pdf, other]

Mining user interaction patterns in the darkweb to predict enterprise cyber incidents

Authors: Soumajyoti Sarkar, Mohammad Almukaynizi, Jana Shakarian, Paulo Shakarian

Abstract: With rise in security breaches over the past few years, there has been an increasing need to mine insights from social media platforms to raise alerts of possible attacks in an attempt to defend conflict during competition. In this study, we attempt to build a framework that utilizes unconventional signals from the darkweb forums by leveraging the reply network structure of user interactions with… ▽ More With rise in security breaches over the past few years, there has been an increasing need to mine insights from social media platforms to raise alerts of possible attacks in an attempt to defend conflict during competition. In this study, we attempt to build a framework that utilizes unconventional signals from the darkweb forums by leveraging the reply network structure of user interactions with the goal of predicting enterprise related external cyber attacks. We use both unsupervised and supervised learning models that address the challenges that come with the lack of enterprise attack metadata for ground truth validation as well as insufficient data for training the models. We validate our models on a binary classification problem that attempts to predict cyber attacks on a daily basis for an organization. Using several controlled studies on features leveraging the network structure, we measure the extent to which the indicators from the darkweb forums can be successfully used to predict attacks. We use information from 53 forums in the darkweb over a span of 17 months for the task. Our framework to predict real world organization cyber attacks of 3 different security events, suggest that focusing on the reply path structure between groups of users based on random walk transitions and community structures has an advantage in terms of better performance solely relying on forum or user posting statistics prior to attacks. △ Less

Submitted 20 June, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

Comments: arXiv admin note: text overlap with arXiv:1811.06537

arXiv:1909.02872 [pdf, other]

Can social influence be exploited to compromise security: An online experimental evaluation

Authors: Soumajyoti Sarkar, Paulo Shakarian, Mika Armenta, Danielle Sanchez, Kiran Lakkaraju

Abstract: Social media has enabled users and organizations to obtain information about technology usage like software usage and even security feature usage. However, on the dark side it has also allowed an adversary to potentially exploit the users in a manner to either obtain information from them or influence them towards decisions that might have malicious settings or intents. While there have been subst… ▽ More Social media has enabled users and organizations to obtain information about technology usage like software usage and even security feature usage. However, on the dark side it has also allowed an adversary to potentially exploit the users in a manner to either obtain information from them or influence them towards decisions that might have malicious settings or intents. While there have been substantial efforts into understanding how social influence affects one's likelihood to adopt a security technology, especially its correlation with the number of friends adopting the same technology, in this study we investigate whether peer influence can dictate what users decide over and above their own knowledge. To this end, we manipulate social signal exposure in an online controlled experiment with human participants to investigate whether social influence can be harnessed in a negative way to steer users towards harmful security choices. We analyze this through a controlled game where each participant selects one option when presented with six security technologies with differing utilities, with one choice having the most utility. Over multiple rounds of the game, we observe that social influence as a tool can be quite powerful in manipulating a user's decision towards adoption of security technologies that are less efficient. However, what stands out more in the process is that the manner in which a user receives social signals from its peers decides the extent to which social influence can be successful in changing a user's behavior. △ Less

Submitted 4 September, 2019; originally announced September 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1909.01409

arXiv:1909.01409 [pdf, other]

doi 10.1371/journal.pone.0234875

Use of a controlled experiment and computational models to measure the impact of sequential peer exposures on decision making

Authors: Soumajyoti Sarkar, Ashkan Aleali, Paulo Shakarian, Mika Armenta, Danielle Sanchez, Kiran Lakkaraju

Abstract: It is widely believed that one's peers influence product adoption behaviors. This relationship has been linked to the number of signals a decision-maker receives in a social network. But it is unclear if these same principles hold when the pattern by which it receives these signals vary and when peer influence is directed towards choices which are not optimal. To investigate that, we manipulate so… ▽ More It is widely believed that one's peers influence product adoption behaviors. This relationship has been linked to the number of signals a decision-maker receives in a social network. But it is unclear if these same principles hold when the pattern by which it receives these signals vary and when peer influence is directed towards choices which are not optimal. To investigate that, we manipulate social signal exposure in an online controlled experiment using a game with human participants. Each participant in the game makes a decision among choices with differing utilities. We observe the following: (1) even in the presence of monetary risks and previously acquired knowledge of the choices, decision-makers tend to deviate from the obvious optimal decision when their peers make similar decision which we call the influence decision, (2) when the quantity of social signals vary over time, the forwarding probability of the influence decision and therefore being responsive to social influence does not necessarily correlate proportionally to the absolute quantity of signals. To better understand how these rules of peer influence could be used in modeling applications of real world diffusion and in networked environments, we use our behavioral findings to simulate spreading dynamics in real world case studies. We specifically try to see how cumulative influence plays out in the presence of user uncertainty and measure its outcome on rumor diffusion, which we model as an example of sub-optimal choice diffusion. Together, our simulation results indicate that sequential peer effects from the influence decision overcomes individual uncertainty to guide faster rumor diffusion over time. However, when the rate of diffusion is slow in the beginning, user uncertainty can have a substantial role compared to peer influence in deciding the adoption trajectory of a piece of questionable information. △ Less

Submitted 5 June, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

arXiv:1905.01556 [pdf, other]

Detecting Pathogenic Social Media Accounts without Content or Network Structure

Authors: Elham Shaabani, Ruocheng Guo, Paulo Shakarian

Abstract: The spread of harmful mis-information in social media is a pressing problem. We refer accounts that have the capability of spreading such information to viral proportions as "Pathogenic Social Media" accounts. These accounts include terrorist supporters accounts, water armies, and fake news writers. We introduce an unsupervised causality-based framework that also leverages label propagation. This… ▽ More The spread of harmful mis-information in social media is a pressing problem. We refer accounts that have the capability of spreading such information to viral proportions as "Pathogenic Social Media" accounts. These accounts include terrorist supporters accounts, water armies, and fake news writers. We introduce an unsupervised causality-based framework that also leverages label propagation. This approach identifies these users without using network structure, cascade path information, content and user's information. We show our approach obtains higher precision (0.75) in identifying Pathogenic Social Media accounts in comparison with random (precision of 0.11) and existing bot detection (precision of 0.16) methods. △ Less

Submitted 4 May, 2019; originally announced May 2019.

Comments: 8 pages, 5 figures, International Conference on Data Intelligence and Security. arXiv admin note: text overlap with arXiv:1905.01553

arXiv:1905.01553 [pdf, other]

An End-to-End Framework to Identify Pathogenic Social Media Accounts on Twitter

Authors: Elham Shaabani, Ashkan Sadeghi-Mobarakeh, Hamidreza Alvari, Paulo Shakarian

Abstract: Pathogenic Social Media (PSM) accounts such as terrorist supporter accounts and fake news writers have the capability of spreading disinformation to viral proportions. Early detection of PSM accounts is crucial as they are likely to be key users to make malicious information "viral". In this paper, we adopt the causal inference framework along with graph-based metrics in order to distinguish PSMs… ▽ More Pathogenic Social Media (PSM) accounts such as terrorist supporter accounts and fake news writers have the capability of spreading disinformation to viral proportions. Early detection of PSM accounts is crucial as they are likely to be key users to make malicious information "viral". In this paper, we adopt the causal inference framework along with graph-based metrics in order to distinguish PSMs from normal users within a short time of their activities. We propose both supervised and semi-supervised approaches without taking the network information and content into account. Results on a real-world dataset from Twitter accentuates the advantage of our proposed frameworks. We show our approach achieves 0.28 improvement in F1 score over existing approaches with the precision of 0.90 and F1 score of 0.63. △ Less

Submitted 4 May, 2019; originally announced May 2019.

Comments: 9 pages, 8 figures, International Conference on Data Intelligence and Security. arXiv admin note: text overlap with arXiv:1905.01556

arXiv:1904.05161 [pdf, other]

Understanding Information Flow in Cascades Using Network Motifs

Authors: Soumajyoti Sarkar, Hamidreza Alvari, Paulo Shakarian

Abstract: A growing set of applications consider the process of network formation by using subgraphs as a tool for generating the network topology. One of the pressing research challenges is thus to be able to use these subgraphs to understand the network topology of information cascades which ultimately paves the way to theorize about how information spreads over time. In this paper, we make the first atte… ▽ More A growing set of applications consider the process of network formation by using subgraphs as a tool for generating the network topology. One of the pressing research challenges is thus to be able to use these subgraphs to understand the network topology of information cascades which ultimately paves the way to theorize about how information spreads over time. In this paper, we make the first attempt at using network motifs to understand whether or not they can be used as generative elements for the diffusion network organization during different phases of the cascade lifecycle. In doing so, we propose a motif percolation-based algorithm that uses network motifs to measure the extent to which they can represent the temporal cascade network organization. We compare two phases of the cascade lifecycle from the perspective of diffusion-- the phase of steep growth and the phase of inhibition prior to its saturation. Our experiments on a set of cascades from the Weibo platform and with 5-node motifs demonstrate that there are only a few specific motif patterns with triads that are able to characterize the spreading process and hence the network organization during the inhibition region better than during the phase of high growth. In contrast, we do not find compelling results for the phase of steep growth. △ Less

Submitted 8 April, 2019; originally announced April 2019.

Comments: arXiv admin note: text overlap with arXiv:1903.00862

arXiv:1903.01693 [pdf, other]

Less is More: Semi-Supervised Causal Inference for Detecting Pathogenic Users in Social Media

Authors: Hamidreza Alvari, Elham Shaabani, Soumajyoti Sarkar, Ghazaleh Beigi, Paulo Shakarian

Abstract: Recent years have witnessed a surge of manipulation of public opinion and political events by malicious social media actors. These users are referred to as "Pathogenic Social Media (PSM)" accounts. PSMs are key users in spreading misinformation in social media to viral proportions. These accounts can be either controlled by real users or automated bots. Identification of PSMs is thus of utmost imp… ▽ More Recent years have witnessed a surge of manipulation of public opinion and political events by malicious social media actors. These users are referred to as "Pathogenic Social Media (PSM)" accounts. PSMs are key users in spreading misinformation in social media to viral proportions. These accounts can be either controlled by real users or automated bots. Identification of PSMs is thus of utmost importance for social media authorities. The burden usually falls to automatic approaches that can identify these accounts and protect social media reputation. However, lack of sufficient labeled examples for devising and training sophisticated approaches to combat these accounts is still one of the foremost challenges facing social media firms. In contrast, unlabeled data is abundant and cheap to obtain thanks to massive user-generated data. In this paper, we propose a semi-supervised causal inference PSM detection framework, SemiPsm, to compensate for the lack of labeled data. In particular, the proposed method leverages unlabeled data in the form of manifold regularization and only relies on cascade information. This is in contrast to the existing approaches that use exhaustive feature engineering (e.g., profile information, network structure, etc.). Evidence from empirical experiments on a real-world ISIS-related dataset from Twitter suggests promising results of utilizing unlabeled instances for detecting PSMs. △ Less

Submitted 5 March, 2019; originally announced March 2019.

Comments: Companion Proceedings of the 2019 World Wide Web Conference

arXiv:1903.00862 [pdf, other]

Using network motifs to characterize temporal network evolution leading to diffusion inhibition

Authors: Soumajyoti Sarkar, Ruocheng Guo, Paulo Shakarian

Abstract: Network motifs are patterns of over-represented node interactions in a network which have been previously used as building blocks to understand various aspects of the social networks. In this paper, we use motif patterns to characterize the information diffusion process in social networks. We study the lifecycle of information cascades to understand what leads to saturation of growth in terms of c… ▽ More Network motifs are patterns of over-represented node interactions in a network which have been previously used as building blocks to understand various aspects of the social networks. In this paper, we use motif patterns to characterize the information diffusion process in social networks. We study the lifecycle of information cascades to understand what leads to saturation of growth in terms of cascade reshares, thereby resulting in expiration, an event we call ``diffusion inhibition''. In an attempt to understand what causes inhibition, we use motifs to dissect the network obtained from information cascades coupled with traces of historical diffusion or social network links. Our main results follow from experiments on a dataset of cascades from the Weibo platform and the Flixster movie ratings. We observe the temporal counts of 5-node undirected motifs from the cascade temporal networks leading to the inhibition stage. Empirical evidences from the analysis lead us to conclude the following about stages preceding inhibition: (1) individuals tend to adopt information more from users they have known in the past through social networks or previous interactions thereby creating patterns containing triads more frequently than acyclic patterns with linear chains and (2) users need multiple exposures or rounds of social reinforcement for them to adopt an information and as a result information starts spreading slowly thereby leading to the death of the cascade. Following these observations, we use motif based features to predict the edge cardinality of the network exhibited at the time of inhibition. We test features of motif patterns by using regression models for both individual patterns and their combination and we find that motifs as features are better predictors of the future network organization than individual node centralities. △ Less

Submitted 3 March, 2019; originally announced March 2019.

arXiv:1902.10366 [pdf, other]

Leveraging Motifs to Model the Temporal Dynamics of Diffusion Networks

Authors: Soumajyoti Sarkar, Hamidreza Alvari, Paulo Shakarian

Abstract: Information diffusion mechanisms based on social influence models are mainly studied using likelihood of adoption when active neighbors expose a user to a message. The problem arises primarily from the fact that for the most part, this explicit information of who-exposed-whom among a group of active neighbors in a social network, before a susceptible node is infected is not available. In this pape… ▽ More Information diffusion mechanisms based on social influence models are mainly studied using likelihood of adoption when active neighbors expose a user to a message. The problem arises primarily from the fact that for the most part, this explicit information of who-exposed-whom among a group of active neighbors in a social network, before a susceptible node is infected is not available. In this paper, we attempt to understand the diffusion process through information cascades by studying the temporal network structure of the cascades. In doing so, we accommodate the effect of exposures from active neighbors of a node through a network pruning technique that leverages network motifs to identify potential infectors responsible for exposures from among those active neighbors. We attempt to evaluate the effectiveness of the components used in modeling cascade dynamics and especially whether the additional effect of the exposure information is useful. Following this model, we develop an inference algorithm namely InferCut, that uses parameters learned from the model and the exposure information to predict the actual parent node of each potentially susceptible user in a given cascade. Empirical evaluation on a real world dataset from Weibo social network demonstrate the significance of incorporating exposure information in recovering the exact parents of the exposed users at the early stages of the diffusion process. △ Less

Submitted 22 March, 2020; v1 submitted 27 February, 2019; originally announced February 2019.

arXiv:1902.01970 [pdf, other]

Hawkes Process for Understanding the Influence of Pathogenic Social Media Accounts

Authors: Hamidreza Alvari, Paulo Shakarian

Abstract: Over the past years, political events and public opinion on the Web have been allegedly manipulated by accounts dedicated to spreading disinformation and performing malicious activities on social media. These accounts hereafter referred to as "Pathogenic Social Media (PSM)" accounts, are often controlled by terrorist supporters, water armies or fake news writers and hence can pose threats to socia… ▽ More Over the past years, political events and public opinion on the Web have been allegedly manipulated by accounts dedicated to spreading disinformation and performing malicious activities on social media. These accounts hereafter referred to as "Pathogenic Social Media (PSM)" accounts, are often controlled by terrorist supporters, water armies or fake news writers and hence can pose threats to social media and general public. Understanding and analyzing PSMs could help social media firms devise sophisticated and automated techniques that could be deployed to stop them from reaching their audience and consequently reduce their threat. In this paper, we leverage the well-known statistical technique "Hawkes Process" to quantify the influence of PSM accounts on the dissemination of malicious information on social media platforms. Our findings on a real-world ISIS-related dataset from Twitter indicate that PSMs are significantly different from regular users in making a message viral. Specifically, we observed that PSMs do not usually post URLs from mainstream news sources. Instead, their tweets usually receive large impact on audience, if contained URLs from Facebook and alternative news outlets. In contrary, tweets posted by regular users receive nearly equal impression regardless of the posted URLs and their sources. Our findings can further shed light on understanding and detecting PSM accounts. △ Less

Submitted 5 February, 2019; originally announced February 2019.

Comments: IEEE Conference on Data Intelligence and Security (ICDIS) 2019

arXiv:1902.01577 [pdf, other]

Detection of Violent Extremists in Social Media

Authors: Hamidreza Alvari, Soumajyoti Sarkar, Paulo Shakarian

Abstract: The ease of use of the Internet has enabled violent extremists such as the Islamic State of Iraq and Syria (ISIS) to easily reach large audience, build personal relationships and increase recruitment. Social media are primarily based on the reports they receive from their own users to mitigate the problem. Despite efforts of social media in suspending many accounts, this solution is not guaranteed… ▽ More The ease of use of the Internet has enabled violent extremists such as the Islamic State of Iraq and Syria (ISIS) to easily reach large audience, build personal relationships and increase recruitment. Social media are primarily based on the reports they receive from their own users to mitigate the problem. Despite efforts of social media in suspending many accounts, this solution is not guaranteed to be effective, because not all extremists are caught this way, or they can simply return with another account or migrate to other social networks. In this paper, we design an automatic detection scheme that using as little as three groups of information related to usernames, profile, and textual content of users, determines whether or not a given username belongs to an extremist user. We first demonstrate that extremists are inclined to adopt usernames that are similar to the ones that their like-minded have adopted in the past. We then propose a detection framework that deploys features which are highly indicative of potential online extremism. Results on a real-world ISIS-related dataset from Twitter demonstrate the effectiveness of the methodology in identifying extremist users. △ Less

Submitted 5 February, 2019; originally announced February 2019.

Comments: IEEE Conference on Data Intelligence and Security (ICDIS) 2019

arXiv:1811.06537 [pdf, other]

Predicting enterprise cyber incidents using social network analysis on the darkweb hacker forums

Authors: Soumajyoti Sarkar, Mohammad Almukaynizi, Jana Shakarian, Paulo Shakarian

Abstract: With rise in security breaches over the past few years, there has been an increasing need to mine insights from social media platforms to raise alerts of possible attacks in an attempt to defend conflict during competition. We use information from the darkweb forums by leveraging the reply network structure of user interactions with the goal of predicting enterprise cyber attacks. We use a suite o… ▽ More With rise in security breaches over the past few years, there has been an increasing need to mine insights from social media platforms to raise alerts of possible attacks in an attempt to defend conflict during competition. We use information from the darkweb forums by leveraging the reply network structure of user interactions with the goal of predicting enterprise cyber attacks. We use a suite of social network features on top of supervised learning models and validate them on a binary classification problem that attempts to predict whether there would be an attack on any given day for an organization. We conclude from our experiments using information from 53 forums in the darkweb over a span of 12 months to predict real world organization cyber attacks of 2 different security events that analyzing the path structure between groups of users is better than just studying network centralities like Pagerank or relying on the user posting statistics in the forums. △ Less

Submitted 15 November, 2018; originally announced November 2018.

Comments: 7 pages

arXiv:1810.12906 [pdf, other]

Finding Cryptocurrency Attack Indicators Using Temporal Logic and Darkweb Data

Authors: Mohammed Almukaynizi, Vivin Paliath, Malay Shah, Malav Shah, Paulo Shakarian

Abstract: With the recent prevalence of darkweb/deepweb (D2web) sites specializing in the trade of exploit kits and malware, malicious actors have easy-access to a wide-range of tools that can empower their offensive capability. In this study, we apply concepts from causal reasoning, itemset mining, and logic programming on historical cryptocurrency-related cyber incidents with intelligence collected from o… ▽ More With the recent prevalence of darkweb/deepweb (D2web) sites specializing in the trade of exploit kits and malware, malicious actors have easy-access to a wide-range of tools that can empower their offensive capability. In this study, we apply concepts from causal reasoning, itemset mining, and logic programming on historical cryptocurrency-related cyber incidents with intelligence collected from over 400 D2web hacker forums. Our goal was to find indicators of cyber threats targeting cryptocurrency traders and exchange platforms from hacker activity. Our approach found interesting activities that, when observed together in the D2web, subsequent cryptocurrency-related incidents are at least twice as likely to occur than they would if no activity was observed. We also present an algorithmic extension to a previously-introduced algorithm called APT-Extract that allows to model new semantic structures that are specific to our application. △ Less

Submitted 29 October, 2018; originally announced October 2018.

arXiv:1810.12492 [pdf, other]

DARKMENTION: A Deployed System to Predict Enterprise-Targeted External Cyberattacks

Authors: Mohammed Almukaynizi, Ericsson Marin, Eric Nunes, Paulo Shakarian, Gerardo I. Simari, Dipsy Kapoor, Timothy Siedlecki

Abstract: Recent incidents of data breaches call for organizations to proactively identify cyber attacks on their systems. Darkweb/Deepweb (D2web) forums and marketplaces provide environments where hackers anonymously discuss existing vulnerabilities and commercialize malicious software to exploit those vulnerabilities. These platforms offer security practitioners a threat intelligence environment that allo… ▽ More Recent incidents of data breaches call for organizations to proactively identify cyber attacks on their systems. Darkweb/Deepweb (D2web) forums and marketplaces provide environments where hackers anonymously discuss existing vulnerabilities and commercialize malicious software to exploit those vulnerabilities. These platforms offer security practitioners a threat intelligence environment that allows to mine for patterns related to organization-targeted cyber attacks. In this paper, we describe a system (called DARKMENTION) that learns association rules correlating indicators of attacks from D2web to real-world cyber incidents. Using the learned rules, DARKMENTION generates and submits warnings to a Security Operations Center (SOC) prior to attacks. Our goal was to design a system that automatically generates enterprise-targeted warnings that are timely, actionable, accurate, and transparent. We show that DARKMENTION meets our goal. In particular, we show that it outperforms baseline systems that attempt to generate warnings of cyber attacks related to two enterprises with an average increase in F1 score of about 45% and 57%. Additionally, DARKMENTION was deployed as part of a larger system that is built under a contract with the IARPA Cyber-attack Automated Unconventional Sensor Environment (CAUSE) program. It is actively producing warnings that precede attacks by an average of 3 days. △ Less

Submitted 29 October, 2018; originally announced October 2018.

arXiv:1809.09331 [pdf, other]

Early Identification of Pathogenic Social Media Accounts

Authors: Hamidreza Alvari, Elham Shaabani, Paulo Shakarian

Abstract: Pathogenic Social Media (PSM) accounts such as terrorist supporters exploit large communities of supporters for conducting attacks on social media. Early detection of these accounts is crucial as they are high likely to be key users in making a harmful message "viral". In this paper, we make the first attempt on utilizing causal inference to identify PSMs within a short time frame around their act… ▽ More Pathogenic Social Media (PSM) accounts such as terrorist supporters exploit large communities of supporters for conducting attacks on social media. Early detection of these accounts is crucial as they are high likely to be key users in making a harmful message "viral". In this paper, we make the first attempt on utilizing causal inference to identify PSMs within a short time frame around their activity. We propose a time-decay causality metric and incorporate it into a causal community detection-based algorithm. The proposed algorithm is applied to groups of accounts sharing similar causality features and is followed by a classification algorithm to classify accounts as PSM or not. Unlike existing techniques that take significant time to collect information such as network, cascade path, or content, our scheme relies solely on action log of users. Results on a real-world dataset from Twitter demonstrate effectiveness and efficiency of our approach. We achieved precision of 0.84 for detecting PSMs only based on their first 10 days of activity; the misclassified accounts were then detected 10 days later. △ Less

Submitted 26 September, 2018; v1 submitted 25 September, 2018; originally announced September 2018.

Comments: IEEE Intelligence and Security Informatics (ISI) 2018

arXiv:1809.06050 [pdf, other]

doi 10.1007/s13278-017-0475-9

Understanding and forecasting lifecycle events in information cascades

Authors: Soumajyoti Sarkar, Ruocheng Guo, Paulo Shakarian

Abstract: Most social network sites allow users to reshare a piece of information posted by a user. As time progresses, the cascade of reshares grows, eventually saturating after a certain time period. While previous studies have focused heavily on one aspect of the cascade phenomenon, specifically predicting when the cascade would go viral, in this paper, we take a more holistic approach by analyzing the o… ▽ More Most social network sites allow users to reshare a piece of information posted by a user. As time progresses, the cascade of reshares grows, eventually saturating after a certain time period. While previous studies have focused heavily on one aspect of the cascade phenomenon, specifically predicting when the cascade would go viral, in this paper, we take a more holistic approach by analyzing the occurrence of two events within the cascade lifecycle - the period of maximum growth in terms of surge in reshares and the period where the cascade starts declining in adoption. We address the challenges in identifying these periods and then proceed to make a comparative analysis of these periods from the perspective of network topology. We study the effect of several node-centric structural measures on the reshare responses using Granger causality which helps us quantify the significance of the network measures and understand the extent to which the network topology impacts the growth dynamics. This evaluation is performed on a dataset of 7407 cascades extracted from the Weibo social network. Using our causality framework, we found that an entropy measure based on nodal degree causally affects the occurrence of these events in 93.95% of cascades. Surprisingly, this outperformed clustering coefficient and PageRank which we hypothesized would be more indicative of the growth dynamics based on earlier studies. We also extend the Granger-causality Vector Autoregression (VAR) model to forecast the times at which the events occur in the cascade lifecycle. △ Less

Submitted 22 March, 2020; v1 submitted 17 September, 2018; originally announced September 2018.

Journal ref: Social Network Analysis and Mining 7.1 (2017): 55

arXiv:1806.09787 [pdf, ps, other]

Causal Inference for Early Detection of Pathogenic Social Media Accounts

Authors: Hamidreza Alvari, Paulo Shakarian

Abstract: Pathogenic social media accounts such as terrorist supporters exploit communities of supporters for conducting attacks on social media. Early detection of PSM accounts is crucial as they are likely to be key users in making a harmful message "viral". This paper overviews my recent doctoral work on utilizing causal inference to identify PSM accounts within a short time frame around their activity.… ▽ More Pathogenic social media accounts such as terrorist supporters exploit communities of supporters for conducting attacks on social media. Early detection of PSM accounts is crucial as they are likely to be key users in making a harmful message "viral". This paper overviews my recent doctoral work on utilizing causal inference to identify PSM accounts within a short time frame around their activity. The proposed scheme (1) assigns time-decay causality scores to users, (2) applies a community detection-based algorithm to group of users sharing similar causality scores and finally (3) deploys a classification algorithm to classify accounts. Unlike existing techniques that require network structure, cascade path, or content, our scheme relies solely on action log of users. △ Less

Submitted 3 August, 2018; v1 submitted 26 June, 2018; originally announced June 2018.

Comments: Doctoral Consortium - 2018 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation

arXiv:1801.09781 [pdf, other]

Early Warnings of Cyber Threats in Online Discussions

Authors: Anna Sapienza, Alessandro Bessi, Saranya Damodaran, Paulo Shakarian, Kristina Lerman, Emilio Ferrara

Abstract: We introduce a system for automatically generating warnings of imminent or current cyber-threats. Our system leverages the communication of malicious actors on the darkweb, as well as activity of cyber security experts on social media platforms like Twitter. In a time period between September, 2016 and January, 2017, our method generated 661 alerts of which about 84% were relevant to current or im… ▽ More We introduce a system for automatically generating warnings of imminent or current cyber-threats. Our system leverages the communication of malicious actors on the darkweb, as well as activity of cyber security experts on social media platforms like Twitter. In a time period between September, 2016 and January, 2017, our method generated 661 alerts of which about 84% were relevant to current or imminent cyber-threats. In the paper, we first illustrate the rationale and workflow of our system, then we measure its performance. Our analysis is enriched by two case studies: the first shows how the method could predict DDoS attacks, and how it would have allowed organizations to prepare for the Mirai attacks that caused widespread disruption in October 2016. Second, we discuss the method's timely identification of various instances of data breaches. △ Less

Submitted 29 January, 2018; originally announced January 2018.

Journal ref: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp:667-674, 2017

arXiv:1712.09133 [pdf, other]

Strongly Hierarchical Factorization Machines and ANOVA Kernel Regression

Authors: Ruocheng Guo, Hamidreza Alvari, Paulo Shakarian

Abstract: High-order parametric models that include terms for feature interactions are applied to various data mining tasks, where ground truth depends on interactions of features. However, with sparse data, the high- dimensional parameters for feature interactions often face three issues: expensive computation, difficulty in parameter estimation and lack of structure. Previous work has proposed approaches… ▽ More High-order parametric models that include terms for feature interactions are applied to various data mining tasks, where ground truth depends on interactions of features. However, with sparse data, the high- dimensional parameters for feature interactions often face three issues: expensive computation, difficulty in parameter estimation and lack of structure. Previous work has proposed approaches which can partially re- solve the three issues. In particular, models with factorized parameters (e.g. Factorization Machines) and sparse learning algorithms (e.g. FTRL-Proximal) can tackle the first two issues but fail to address the third. Regarding to unstructured parameters, constraints or complicated regularization terms are applied such that hierarchical structures can be imposed. However, these methods make the optimization problem more challenging. In this work, we propose Strongly Hierarchical Factorization Machines and ANOVA kernel regression where all the three issues can be addressed without making the optimization problem more difficult. Experimental results show the proposed models significantly outperform the state-of-the-art in two data mining tasks: cold-start user response time prediction and stock volatility prediction. △ Less

Submitted 5 January, 2018; v1 submitted 25 December, 2017; originally announced December 2017.

Comments: 9 pages, to appear in SDM'18

arXiv:1705.10786 [pdf, other]

Semi-Supervised Learning for Detecting Human Trafficking

Authors: Hamidreza Alvari, Paulo Shakarian, J. E. Kelly Snyder

Abstract: Human trafficking is one of the most atrocious crimes and among the challenging problems facing law enforcement which demands attention of global magnitude. In this study, we leverage textual data from the website "Backpage"- used for classified advertisement- to discern potential patterns of human trafficking activities which manifest online and identify advertisements of high interest to law enf… ▽ More Human trafficking is one of the most atrocious crimes and among the challenging problems facing law enforcement which demands attention of global magnitude. In this study, we leverage textual data from the website "Backpage"- used for classified advertisement- to discern potential patterns of human trafficking activities which manifest online and identify advertisements of high interest to law enforcement. Due to the lack of ground truth, we rely on a human analyst from law enforcement, for hand-labeling a small portion of the crawled data. We extend the existing Laplacian SVM and present S3VM-R, by adding a regularization term to exploit exogenous information embedded in our feature space in favor of the task at hand. We train the proposed method using labeled and unlabeled data and evaluate it on a fraction of the unlabeled data, herein referred to as unseen data, with our expert's further verification. Results from comparisons between our method and other semi-supervised and supervised approaches on the labeled data demonstrate that our learner is effective in identifying advertisements of high interest to law enforcement △ Less

Submitted 30 May, 2017; originally announced May 2017.

arXiv:1705.02399 [pdf, other]

Temporal Analysis of Influence to Predict Users' Adoption in Online Social Networks

Authors: Ericsson Marin, Ruocheng Guo, Paulo Shakarian

Abstract: Different measures have been proposed to predict whether individuals will adopt a new behavior in online social networks, given the influence produced by their neighbors. In this paper, we show one can achieve significant improvement over these standard measures, extending them to consider a pair of time constraints. These constraints provide a better proxy for social influence, showing a stronger… ▽ More Different measures have been proposed to predict whether individuals will adopt a new behavior in online social networks, given the influence produced by their neighbors. In this paper, we show one can achieve significant improvement over these standard measures, extending them to consider a pair of time constraints. These constraints provide a better proxy for social influence, showing a stronger correlation to the probability of influence as well as the ability to predict influence. △ Less

Submitted 5 May, 2017; originally announced May 2017.

Comments: 6 pages, 2 figures, 2017 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS 2017). July 5 - 8, 2017

arXiv:1608.02646 [pdf, other]

Toward Early and Order-of-Magnitude Cascade Prediction in Social Networks

Authors: Ruocheng Guo, Elham Shaabani, Abhinav Bhatnagar, Paulo Shakarian

Abstract: When a piece of information (microblog, photograph, video, link, etc.) starts to spread in a social network, an important question arises: will it spread to viral proportions - where viral can be defined as an order-of-magnitude increase. However, several previous studies have established that cascade size and frequency are related through a power-law - which leads to a severe imbalance in this cl… ▽ More When a piece of information (microblog, photograph, video, link, etc.) starts to spread in a social network, an important question arises: will it spread to viral proportions - where viral can be defined as an order-of-magnitude increase. However, several previous studies have established that cascade size and frequency are related through a power-law - which leads to a severe imbalance in this classification problem. In this paper, we devise a suite of measurements based on structural diversity - the variety of social contexts (communities) in which individuals partaking in a given cascade engage. We demonstrate these measures are able to distinguish viral from non-viral cascades, despite the severe imbalance of the data for this problem. Further, we leverage these measurements as features in a classification approach, successfully predicting microblogs that grow from 50 to 500 reposts with precision of 0.69 and recall of 0.52 for the viral class - despite this class comprising under 2% of samples. This significantly outperforms our baseline approach as well as the current state-of-the-art. We also show this approach also performs well for identifying if cascades observed for 60 minutes will grow to 500 reposts as well as demonstrate how we can tradeoff between precision and recall. △ Less

Submitted 8 August, 2016; originally announced August 2016.

Comments: 27 pages, 17 figures, accepted by SNAM (Social Network Analysis and Mining)

arXiv:1607.08691 [pdf, other]

A Non-Parametric Learning Approach to Identify Online Human Trafficking

Authors: Hamidreza Alvari, Paulo Shakarian, J. E. Kelly Snyder

Abstract: Human trafficking is among the most challenging law enforcement problems which demands persistent fight against from all over the globe. In this study, we leverage readily available data from the website "Backpage"-- used for classified advertisement-- to discern potential patterns of human trafficking activities which manifest online and identify most likely trafficking related advertisements. Du… ▽ More Human trafficking is among the most challenging law enforcement problems which demands persistent fight against from all over the globe. In this study, we leverage readily available data from the website "Backpage"-- used for classified advertisement-- to discern potential patterns of human trafficking activities which manifest online and identify most likely trafficking related advertisements. Due to the lack of ground truth, we rely on two human analysts --one human trafficking victim survivor and one from law enforcement, for hand-labeling the small portion of the crawled data. We then present a semi-supervised learning approach that is trained on the available labeled and unlabeled data and evaluated on unseen data with further verification of experts. △ Less

Submitted 1 August, 2016; v1 submitted 29 July, 2016; originally announced July 2016.

Comments: Accepted in IEEE Intelligence and Security Informatics 2016 Conference (ISI 2016)

arXiv:1607.08583 [pdf, other]

Darknet and Deepnet Mining for Proactive Cybersecurity Threat Intelligence

Authors: Eric Nunes, Ahmad Diab, Andrew Gunn, Ericsson Marin, Vineet Mishra, Vivin Paliath, John Robertson, Jana Shakarian, Amanda Thart, Paulo Shakarian

Abstract: In this paper, we present an operational system for cyber threat intelligence gathering from various social platforms on the Internet particularly sites on the darknet and deepnet. We focus our attention to collecting information from hacker forum discussions and marketplaces offering products and services focusing on malicious hacking. We have developed an operational system for obtaining informa… ▽ More In this paper, we present an operational system for cyber threat intelligence gathering from various social platforms on the Internet particularly sites on the darknet and deepnet. We focus our attention to collecting information from hacker forum discussions and marketplaces offering products and services focusing on malicious hacking. We have developed an operational system for obtaining information from these sites for the purposes of identifying emerging cyber threats. Currently, this system collects on average 305 high-quality cyber threat warnings each week. These threat warnings include information on newly developed malware and exploits that have not yet been deployed in a cyber-attack. This provides a significant service to cyber-defenders. The system is significantly augmented through the use of various data mining and machine learning techniques. With the use of machine learning models, we are able to recall 92% of products in marketplaces and 80% of discussions on forums relating to malicious hacking with high precision. We perform preliminary analysis on the data collected, demonstrating its application to aid a security expert for better threat analysis. △ Less

Submitted 28 July, 2016; originally announced July 2016.

Comments: 6 page paper accepted to be presented at IEEE Intelligence and Security Informatics 2016 Tucson, Arizona USA September 27-30, 2016

arXiv:1607.08580 [pdf]

MIST: Missing Person Intelligence Synthesis Toolkit

Authors: Elham Shaabani, Hamidreza Alvari, Paulo Shakarian, J. E. Kelly Snyder

Abstract: Each day, approximately 500 missing persons cases occur that go unsolved/unresolved in the United States. The non-profit organization known as the Find Me Group (FMG), led by former law enforcement professionals, is dedicated to solving or resolving these cases. This paper introduces the Missing Person Intelligence Synthesis Toolkit (MIST) which leverages a data-driven variant of geospatial abduct… ▽ More Each day, approximately 500 missing persons cases occur that go unsolved/unresolved in the United States. The non-profit organization known as the Find Me Group (FMG), led by former law enforcement professionals, is dedicated to solving or resolving these cases. This paper introduces the Missing Person Intelligence Synthesis Toolkit (MIST) which leverages a data-driven variant of geospatial abductive inference. This system takes search locations provided by a group of experts and rank-orders them based on the probability assigned to areas based on the prior performance of the experts taken as a group. We evaluate our approach compared to the current practices employed by the Find Me Group and found it significantly reduces the search area - leading to a reduction of 31 square miles over 24 cases we examined in our experiments. Currently, we are using MIST to aid the Find Me Group in an active missing person case. △ Less

Submitted 29 August, 2016; v1 submitted 28 July, 2016; originally announced July 2016.

Comments: 10 pages, 12 figures, Accepted in CIKM 2016

ACM Class: I.2.1; J.4; G.1.6

arXiv:1607.07903 [pdf, other]

Product Offerings in Malicious Hacker Markets

Authors: Ericsson Marin, Ahmad Diab, Paulo Shakarian

Abstract: Marketplaces specializing in malicious hacking products - including malware and exploits - have recently become more prominent on the darkweb and deepweb. We scrape 17 such sites and collect information about such products in a unified database schema. Using a combination of manual labeling and unsupervised clustering, we examine a corpus of products in order to understand their various categories… ▽ More Marketplaces specializing in malicious hacking products - including malware and exploits - have recently become more prominent on the darkweb and deepweb. We scrape 17 such sites and collect information about such products in a unified database schema. Using a combination of manual labeling and unsupervised clustering, we examine a corpus of products in order to understand their various categories and how they become specialized with respect to vendor and marketplace. This initial study presents how we effectively employed unsupervised techniques to this data as well as the types of insights we gained on various categories of malicious hacking products. △ Less

Submitted 26 July, 2016; originally announced July 2016.

Comments: 3 pages, 1 figure, 3 tables. Accepted for publication in IEEE Intelligence and Security Informatics (ISI2016)

arXiv:1607.02171 [pdf, other]

Argumentation Models for Cyber Attribution

Authors: Eric Nunes, Paulo Shakarian, Gerardo I. Simari, Andrew Ruef

Abstract: A major challenge in cyber-threat analysis is combining information from different sources to find the person or the group responsible for the cyber-attack. It is one of the most important technical and policy challenges in cyber-security. The lack of ground truth for an individual responsible for an attack has limited previous studies. In this paper, we take a first step towards overcoming this l… ▽ More A major challenge in cyber-threat analysis is combining information from different sources to find the person or the group responsible for the cyber-attack. It is one of the most important technical and policy challenges in cyber-security. The lack of ground truth for an individual responsible for an attack has limited previous studies. In this paper, we take a first step towards overcoming this limitation by building a dataset from the capture-the-flag event held at DEFCON, and propose an argumentation model based on a formal reasoning framework called DeLP (Defeasible Logic Programming) designed to aid an analyst in attributing a cyber-attack. We build models from latent variables to reduce the search space of culprits (attackers), and show that this reduction significantly improves the performance of classification-based approaches from 37% to 62% in identifying the attacker. △ Less

Submitted 7 July, 2016; originally announced July 2016.

Comments: 8 pages paper to be presented at International Symposium on Foundations of Open Source Intelligence and Security Informatics (FOSINT-SI) 2016 In conjunction with ASONAM 2016 San Francisco, CA, USA, August 19-20, 2016

arXiv:1607.00720 [pdf, other]

An Empirical Evaluation Of Social Influence Metrics

Authors: Nikhil Kumar, Ruocheng Guo, Ashkan Aleali, Paulo Shakarian

Abstract: Predicting when an individual will adopt a new behavior is an important problem in application domains such as marketing and public health. This paper examines the perfor- mance of a wide variety of social network based measurements proposed in the literature - which have not been previously compared directly. We study the probability of an individual becoming influenced based on measurements deri… ▽ More Predicting when an individual will adopt a new behavior is an important problem in application domains such as marketing and public health. This paper examines the perfor- mance of a wide variety of social network based measurements proposed in the literature - which have not been previously compared directly. We study the probability of an individual becoming influenced based on measurements derived from neigh- borhood (i.e. number of influencers, personal network exposure), structural diversity, locality, temporal measures, cascade mea- sures, and metadata. We also examine the ability to predict influence based on choice of classifier and how the ratio of positive to negative samples in both training and testing affect prediction results - further enabling practical use of these concepts for social influence applications. △ Less

Submitted 23 July, 2016; v1 submitted 3 July, 2016; originally announced July 2016.

Comments: 8 pages, 5 figures

arXiv:1606.05730 [pdf, other]

A Comparison of Methods for Cascade Prediction

Authors: Ruocheng Guo, Paulo Shakarian

Abstract: Information cascades exist in a wide variety of platforms on Internet. A very important real-world problem is to identify which information cascades can go viral. A system addressing this problem can be used in a variety of applications including public health, marketing and counter-terrorism. As a cascade can be considered as compound of the social network and the time series. However, in related… ▽ More Information cascades exist in a wide variety of platforms on Internet. A very important real-world problem is to identify which information cascades can go viral. A system addressing this problem can be used in a variety of applications including public health, marketing and counter-terrorism. As a cascade can be considered as compound of the social network and the time series. However, in related literature where methods for solving the cascade prediction problem were proposed, the experimental settings were often limited to only a single metric for a specific problem formulation. Moreover, little attention was paid to the run time of those methods. In this paper, we first formulate the cascade prediction problem as both classification and regression. Then we compare three categories of cascade prediction methods: centrality based, feature based and point process based. We carry out the comparison through evaluation of the methods by both accuracy metrics and run time. The results show that feature based methods can outperform others in terms of prediction accuracy but suffer from heavy overhead especially for large datasets. While point process based methods can also run into issue of long run time when the model can not well adapt to the data. This paper seeks to address issues in order to allow developers of systems for social network analysis to select the most appropriate method for predicting viral information cascades. △ Less

Submitted 18 June, 2016; originally announced June 2016.

Comments: 8 pages, 29 figures, ASONAM 2016 (Industry Track)

arXiv:1508.03965 [pdf, other]

doi 10.1145/2783258.2788618

Early Identification of Violent Criminal Gang Members

Authors: Elham Shaabani, Ashkan Aleali, Paulo Shakarian, John Bertetto

Abstract: Gang violence is a major problem in the United States accounting for a large fraction of homicides and other violent crime. In this paper, we study the problem of early identification of violent gang members. Our approach relies on modified centrality measures that take into account additional data of the individuals in the social network of co-arrestees which together with other arrest metadata p… ▽ More Gang violence is a major problem in the United States accounting for a large fraction of homicides and other violent crime. In this paper, we study the problem of early identification of violent gang members. Our approach relies on modified centrality measures that take into account additional data of the individuals in the social network of co-arrestees which together with other arrest metadata provide a rich set of features for a classification algorithm. We show our approach obtains high precision and recall (0.89 and 0.78 respectively) in the case where the entire network is known and out-performs current approaches used by law-enforcement to the problem in the case where the network is discovered overtime by virtue of new arrests - mimicking real-world law-enforcement operations. Operational issues are also discussed as we are preparing to leverage this method in an operational environment. △ Less

Submitted 17 August, 2015; originally announced August 2015.

Comments: SIGKDD 2015

ACM Class: J.4

arXiv:1508.03371 [pdf, other]

Toward Order-of-Magnitude Cascade Prediction

Authors: Ruocheng Guo, Elham Shaabani, Abhinav Bhatnagar, Paulo Shakarian

Abstract: When a piece of information (microblog, photograph, video, link, etc.) starts to spread in a social network, an important question arises: will it spread to "viral" proportions -- where "viral" is defined as an order-of-magnitude increase. However, several previous studies have established that cascade size and frequency are related through a power-law - which leads to a severe imbalance in this c… ▽ More When a piece of information (microblog, photograph, video, link, etc.) starts to spread in a social network, an important question arises: will it spread to "viral" proportions -- where "viral" is defined as an order-of-magnitude increase. However, several previous studies have established that cascade size and frequency are related through a power-law - which leads to a severe imbalance in this classification problem. In this paper, we devise a suite of measurements based on "structural diversity" -- the variety of social contexts (communities) in which individuals partaking in a given cascade engage. We demonstrate these measures are able to distinguish viral from non-viral cascades, despite the severe imbalance of the data for this problem. Further, we leverage these measurements as features in a classification approach, successfully predicting microblogs that grow from 50 to 500 reposts with precision of 0.69 and recall of 0.52 for the viral class - despite this class comprising under 2\% of samples. This significantly outperforms our baseline approach as well as the current state-of-the-art. Our work also demonstrates how we can tradeoff between precision and recall. △ Less

Submitted 13 August, 2015; originally announced August 2015.

Comments: 4 pages, 15 figures, ASONAM 2015 poster paper

arXiv:1508.01192 [pdf, other]

Mining for Causal Relationships: A Data-Driven Study of the Islamic State

Authors: Andrew Stanton, Amanda Thart, Ashish Jain, Priyank Vyas, Arpan Chatterjee, Paulo Shakarian

Abstract: The Islamic State of Iraq and al-Sham (ISIS) is a dominant insurgent group operating in Iraq and Syria that rose to prominence when it took over Mosul in June, 2014. In this paper, we present a data-driven approach to analyzing this group using a dataset consisting of 2200 incidents of military activity surrounding ISIS and the forces that oppose it (including Iraqi, Syrian, and the American-led c… ▽ More The Islamic State of Iraq and al-Sham (ISIS) is a dominant insurgent group operating in Iraq and Syria that rose to prominence when it took over Mosul in June, 2014. In this paper, we present a data-driven approach to analyzing this group using a dataset consisting of 2200 incidents of military activity surrounding ISIS and the forces that oppose it (including Iraqi, Syrian, and the American-led coalition). We combine ideas from logic programming and causal reasoning to mine for association rules for which we present evidence of causality. We present relationships that link ISIS vehicle-bourne improvised explosive device (VBIED) activity in Syria with military operations in Iraq, coalition air strikes, and ISIS IED activity, as well as rules that may serve as indicators of spikes in indirect fire, suicide attacks, and arrests. △ Less

Submitted 5 August, 2015; originally announced August 2015.

Journal ref: Final version presented at KDD 2015

arXiv:1507.01930 [pdf, other]

Malware Task Identification: A Data Driven Approach

Authors: Eric Nunes, Casey Buto, Paulo Shakarian, Christian Lebiere, Stefano Bennati, Robert Thomson, Holger Jaenisch

Abstract: Identifying the tasks a given piece of malware was designed to perform (e.g. logging keystrokes, recording video, establishing remote access, etc.) is a difficult and time-consuming operation that is largely human-driven in practice. In this paper, we present an automated method to identify malware tasks. Using two different malware collections, we explore various circumstances for each - includin… ▽ More Identifying the tasks a given piece of malware was designed to perform (e.g. logging keystrokes, recording video, establishing remote access, etc.) is a difficult and time-consuming operation that is largely human-driven in practice. In this paper, we present an automated method to identify malware tasks. Using two different malware collections, we explore various circumstances for each - including cases where the training data differs significantly from test; where the malware being evaluated employs packing to thwart analytical techniques; and conditions with sparse training data. We find that this approach consistently out-performs the current state-of-the art software for malware task identification as well as standard machine learning approaches - often achieving an unbiased F1 score of over 0.9. In the near future, we look to deploy our approach for use by analysts in an operational cyber-security environment. △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: 8 pages full paper, accepted FOSINT-SI (2015)

arXiv:1507.01922 [pdf, other]

Cyber-Deception and Attribution in Capture-the-Flag Exercises

Authors: Eric Nunes, Nimish Kulkarni, Paulo Shakarian, Andrew Ruef, Jay Little

Abstract: Attributing the culprit of a cyber-attack is widely considered one of the major technical and policy challenges of cyber-security. The lack of ground truth for an individual responsible for a given attack has limited previous studies. Here, we overcome this limitation by leveraging DEFCON capture-the-flag (CTF) exercise data where the actual ground-truth is known. In this work, we use various clas… ▽ More Attributing the culprit of a cyber-attack is widely considered one of the major technical and policy challenges of cyber-security. The lack of ground truth for an individual responsible for a given attack has limited previous studies. Here, we overcome this limitation by leveraging DEFCON capture-the-flag (CTF) exercise data where the actual ground-truth is known. In this work, we use various classification techniques to identify the culprit in a cyberattack and find that deceptive activities account for the majority of misclassified samples. We also explore several heuristics to alleviate some of the misclassification caused by deception. △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: 4 pages Short name accepted to FOSINT-SI 2015

arXiv:1501.05990 [pdf]

Cyber Attacks and Public Embarrassment: A Survey of Some Notable Hacks

Authors: Jana Shakarian, Paulo Shakarian, Andrew Ruef

Abstract: We hear it all too often in the media: an organization is attacked, its data, often containing personally identifying information, is made public, and a hacking group emerges to claim credit. In this excerpt, we discuss how such groups operate and describe the details of a few major cyber-attacks of this sort in the wider context of how they occurred. We feel that understanding how such groups hav… ▽ More We hear it all too often in the media: an organization is attacked, its data, often containing personally identifying information, is made public, and a hacking group emerges to claim credit. In this excerpt, we discuss how such groups operate and describe the details of a few major cyber-attacks of this sort in the wider context of how they occurred. We feel that understanding how such groups have operated in the past will give organizations ideas of how to defend against them in the future. △ Less

Submitted 23 January, 2015; originally announced January 2015.

arXiv:1404.6699 [pdf, ps, other]

An Argumentation-Based Framework to Address the Attribution Problem in Cyber-Warfare

Authors: Paulo Shakarian, Gerardo I. Simari, Geoffrey Moores, Simon Parsons, Marcelo A. Falappa

Abstract: Attributing a cyber-operation through the use of multiple pieces of technical evidence (i.e., malware reverse-engineering and source tracking) and conventional intelligence sources (i.e., human or signals intelligence) is a difficult problem not only due to the effort required to obtain evidence, but the ease with which an adversary can plant false evidence. In this paper, we introduce a formal re… ▽ More Attributing a cyber-operation through the use of multiple pieces of technical evidence (i.e., malware reverse-engineering and source tracking) and conventional intelligence sources (i.e., human or signals intelligence) is a difficult problem not only due to the effort required to obtain evidence, but the ease with which an adversary can plant false evidence. In this paper, we introduce a formal reasoning system called the InCA (Intelligent Cyber Attribution) framework that is designed to aid an analyst in the attribution of a cyber-operation even when the available information is conflicting and/or uncertain. Our approach combines argumentation-based reasoning, logic programming, and probabilistic models to not only attribute an operation but also explain to the analyst why the system reaches its conclusions. △ Less

Submitted 26 April, 2014; originally announced April 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1401.1475

arXiv:1401.1475 [pdf, ps, other]

Belief Revision in Structured Probabilistic Argumentation

Authors: Paulo Shakarian, Gerardo I. Simari, Marcelo A. Falappa

Abstract: In real-world applications, knowledge bases consisting of all the information at hand for a specific domain, along with the current state of affairs, are bound to contain contradictory data coming from different sources, as well as data with varying degrees of uncertainty attached. Likewise, an important aspect of the effort associated with maintaining knowledge bases is deciding what information… ▽ More In real-world applications, knowledge bases consisting of all the information at hand for a specific domain, along with the current state of affairs, are bound to contain contradictory data coming from different sources, as well as data with varying degrees of uncertainty attached. Likewise, an important aspect of the effort associated with maintaining knowledge bases is deciding what information is no longer useful; pieces of information (such as intelligence reports) may be outdated, may come from sources that have recently been discovered to be of low quality, or abundant evidence may be available that contradicts them. In this paper, we propose a probabilistic structured argumentation framework that arises from the extension of Presumptive Defeasible Logic Programming (PreDeLP) with probabilistic models, and argue that this formalism is capable of addressing the basic issues of handling contradictory and uncertain data. Then, to address the last issue, we focus on the study of non-prioritized belief revision operations over probabilistic PreDeLP programs. We propose a set of rationality postulates -- based on well-known ones developed for classical knowledge bases -- that characterize how such operations should behave, and study a class of operators along with theoretical relationships with the proposed postulates, including a representation theorem stating the equivalence between this class and the class of operators characterized by the postulates. △ Less

Submitted 7 January, 2014; originally announced January 2014.

arXiv:1401.1086 [pdf, other]

Power Grid Defense Against Malicious Cascading Failure

Authors: Paulo Shakarian, Hansheng Lei, Roy Lindelauf

Abstract: An adversary looking to disrupt a power grid may look to target certain substations and sources of power generation to initiate a cascading failure that maximizes the number of customers without electricity. This is particularly an important concern when the enemy has the capability to launch cyber-attacks as practical concerns (i.e. avoiding disruption of service, presence of legacy systems, etc.… ▽ More An adversary looking to disrupt a power grid may look to target certain substations and sources of power generation to initiate a cascading failure that maximizes the number of customers without electricity. This is particularly an important concern when the enemy has the capability to launch cyber-attacks as practical concerns (i.e. avoiding disruption of service, presence of legacy systems, etc.) may hinder security. Hence, a defender can harden the security posture at certain power stations but may lack the time and resources to do this for the entire power grid. We model a power grid as a graph and introduce the cascading failure game in which both the defender and attacker choose a subset of power stations such as to minimize (maximize) the number of consumers having access to producers of power. We formalize problems for identifying both mixed and deterministic strategies for both players, prove complexity results under a variety of different scenarios, identify tractable cases, and develop algorithms for these problems. We also perform an experimental evaluation of the model and game on a real-world power grid network. Empirically, we noted that the game favors the attacker as he benefits more from increased resources than the defender. Further, the minimax defense produces roughly the same expected payoff as an easy-to-compute deterministic load based (DLB) defense when played against a minimax attack strategy. However, DLB performs more poorly than minimax defense when faced with the attacker's best response to DLB. This is likely due to the presence of low-load yet high-payoff nodes, which we also found in our empirical analysis. △ Less

Submitted 6 January, 2014; originally announced January 2014.

arXiv:1309.6450 [pdf]

The Dragon and the Computer: Why Intellectual Property Theft is Compatible with Chinese Cyber-Warfare Doctrine

Authors: Paulo Shakarian, Jana Shakarian, Andrew Ruef

Abstract: Along with the USA and Russia, China is often considered one of the leading cyber-powers in the world. In this excerpt, we explore how Chinese military thought, developed in the 1990s, influenced their cyber-operations in the early 2000s. In particular, we examine the ideas of "Unrestricted Warfare" and "Active Offense" and discuss how they can permit for the theft of intellectual property. We the… ▽ More Along with the USA and Russia, China is often considered one of the leading cyber-powers in the world. In this excerpt, we explore how Chinese military thought, developed in the 1990s, influenced their cyber-operations in the early 2000s. In particular, we examine the ideas of "Unrestricted Warfare" and "Active Offense" and discuss how they can permit for the theft of intellectual property. We then specifically look at how the case study of Operation Aurora, a cyber-operation directed against many major U.S. technology and defense firms, reflects some of these ideas. △ Less

Submitted 25 September, 2013; originally announced September 2013.

Comments: This is an excerpt from the upcoming book Introduction to Cyber-Warfare: A Multidisciplinary Approach published by Syngress (ISBN: 978-0124078147)

arXiv:1309.2963 [pdf, other]

A Scalable Heuristic for Viral Marketing Under the Tip** Model

Authors: Paulo Shakarian, Sean Eyre, Damon Paulo

Abstract: In a "tip**" model, each node in a social network, representing an individual, adopts a property or behavior if a certain number of his incoming neighbors currently exhibit the same. In viral marketing, a key problem is to select an initial "seed" set from the network such that the entire network adopts any behavior given to the seed. Here we introduce a method for quickly finding seed sets that… ▽ More In a "tip**" model, each node in a social network, representing an individual, adopts a property or behavior if a certain number of his incoming neighbors currently exhibit the same. In viral marketing, a key problem is to select an initial "seed" set from the network such that the entire network adopts any behavior given to the seed. Here we introduce a method for quickly finding seed sets that scales to very large networks. Our approach finds a set of nodes that guarantees spreading to the entire network under the tip** model. After experimentally evaluating 31 real-world networks, we found that our approach often finds seed sets that are several orders of magnitude smaller than the population size and outperform nodal centrality measures in most cases. In addition, our approach scales well - on a Friendster social network consisting of 5.6 million nodes and 28 million edges we found a seed set in under 3.6 hours. Our experiments also indicate that our algorithm provides small seed sets even if high-degree nodes are removed. Lastly, we find that highly clustered local neighborhoods, together with dense network-wide community structures, suppress a trend's ability to spread under the tip** model. △ Less

Submitted 11 September, 2013; originally announced September 2013.

Comments: arXiv admin note: substantial text overlap with arXiv:1205.4431

Showing 1–50 of 59 results for author: Shakarian, P