Optimizing Cyber Defense in Dynamic Active Directories through Reinforcement Learning

Diksha Goel1,2, Kristen Moore1,2, Mingyu Guo3, Derui Wang1,2, Minjune Kim1,2, Seyit Camtepe1,2
Abstract

This paper addresses a significant gap in Autonomous Cyber Operations (ACO) literature: the absence of effective edge-blocking ACO strategies in dynamic, real-world networks. It specifically targets the cybersecurity vulnerabilities of organizational Active Directory (AD) systems. Unlike the existing literature on edge-blocking defenses which considers AD systems as static entities, our study counters this by recognizing their dynamic nature and develo** advanced edge-blocking defenses through a Stackelberg game model between attacker and defender. We devise a Reinforcement Learning (RL)-based attack strategy and an RL-assisted Evolutionary Diversity Optimization-based defense strategy, where the attacker and defender improve each other’s strategy via parallel gameplay. To address the computational challenges of training attacker-defender strategies on numerous dynamic AD graphs, we propose an RL Training Facilitator that prunes environments and neural networks to eliminate irrelevant elements, enabling efficient and scalable training for large graphs. We extensively train the attacker strategy, as a sophisticated attacker model is essential for a robust defense. Our empirical results successfully demonstrate that our proposed approach enhances defender’s proficiency in hardening dynamic AD graphs while ensuring scalability for large-scale AD111This work has been supported by the Cyber Security Research Centre Limited whose activities are partially funded by the Australian Government’s Cooperative Research Centres Programme.. The manuscript has been accepted as full paper at European Symposium on Research in Computer Security (ESORICS) 2024.

1CSIRO’s Data61, Australia

2Cyber Security Cooperative Research Centre (CSCRC), Australia

3University of Adelaide, Australia

{diksha.goel, kristen.moore, derek.wang, minjune.kim, seyit.camtepe}@data61.csiro.au

[email protected]

Keywords Active Directory  \cdot Network Security  \cdot Attack Graph  \cdot Reinforcement Learning  \cdot Stackelberg Game

1 Introduction

In the rapidly evolving digital world, organizations are strengthening their cybersecurity in response to the increasing frequency and severity of cyber attacks [1, 2]. Despite these efforts, traditional security operations centre analysts often face large volumes of alerts that lead to alert fatigue and chances of critical warnings being overlooked. This has motivated research into leveraging advances in Artificial Intelligence (AI) to scale and extend the capabilities of human operators to defend networks. One such emerging direction is Autonomous Cyber Operations (ACO), which involves the development of blue team (defender) and red team (attacker) decision-making agents in adversarial scenarios. Reinforcement Learning (RL) based solutions [3, 4] have demonstrated promising results in this domain, where the agents learn optimal cyber defense policies by exploring environmental dynamics. Several platforms, such as FARLAND [5], CybORG [6], and CyberBattleSim [7], have been developed to test and validate RL-based approaches in simulated cybersecurity environments. MITRE developed the FARLAND platform, which employs generative programs to model diverse network environment distributions, facilitating the development of RL-based defense mechanisms against evolving adversarial tactics. The CybORG platform, developed by the Australian Government’s Department of Defense, offers a wide range of simulation environments for Ant Colony Optimization (ACO) research. It spans scenarios from safeguarding autonomous drone networks to defending defense industry enterprises. CyberBattleSim, developed by Microsoft, simulates automated red team activities in networks, emphasizing the offensive side of cybersecurity operations. Although many studies use these platforms to advance ACO, their effectiveness is limited due to the significant difference in scale and complexity between the simulated environments and real-world networks.

Another body of work in the ACO literature [8, 9, 10, 11] focuses specifically on defending Microsoft Active Directory (AD). AD serves as a primary security management tool for Windows Domain Network, enabling administrators to manage and control access to network resources. Given the widespread

Refer to caption
Figure 1: AD attack graph containing 500 computers.

adoption of Microsoft Domain Network among small as well as large organizations, AD has become a prime target for cyber attackers. Reports indicate that monthly, 1.2 million Azure AD accounts are compromised, with 80% of intrusions targeting administrative accounts [12]. These statistics highlight the importance of develo** autonomous cyber defense agents to help harden AD environments. AD environments provide insights into real-world cyber defense scenarios and enable the development and validation of autonomous cyber defense strategies. By simulating attacks and defense mechanisms within the context of AD, we can explore the complexities of real-world cyber ecosystems and develop strategies tailored to mitigate threats at a large scale, thereby enhancing the overall resilience of organizational IT infrastructures against cyber attacks.

AD graph can be represented as an attack graph, where nodes symbolize computers, accounts, or security groups, while directed edges (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) depict trust relationships that an attacker can exploit to escalate privileges or move laterally from node i𝑖iitalic_i to node j𝑗jitalic_j. BloodHound is a widely used tool to discover attack paths in AD graphs. It employs an identity snowball attack, starting from a low-privilege account and progressing towards a high-privilege account (Computer AHasSessionUser BAdminToComputer CHasSessionComputer AUser BAdminToComputer C\text{Computer A}\xrightarrow{\text{HasSession}}\text{User B}\xrightarrow{% \text{AdminTo}}\text{Computer C}Computer A start_ARROW overHasSession → end_ARROW User B start_ARROW overAdminTo → end_ARROW Computer C). Figure 1 shows an AD attack graph generated using DBCreator, a tool devised by the BloodHound team to generate synthetic AD graphs. Before BloodHound, attackers relied on trial and error in AD attack graphs to reach Domain Admin (DA). BloodHound improves this by map** shorter, less detectable attack paths. This ease of access raises security concerns. In response, defenders start exploring strategies like edge blocking in AD attack graphs. BloodHound draws inspiration from academic research [13], where researchers developed a heuristic to block edges in attack graphs. This approach aims to disconnect the graph and hinder attackers from reaching DA. In AD environments, edge blocking involves actions like access revocation or enhanced surveillance to prevent unauthorized access to DA.

Existing solutions [10, 11, 9, 8, 14] for hardening AD graphs overlook their dynamic nature and assume AD graphs to be static. However, real-world AD graphs constantly change primarily due to user activities like logging in and out of computers. In this paper, we develop a framework for training autonomous agents to defend large-scale, dynamic AD graphs. We model a Stackelberg game where the attacker aims to infiltrate and maximize their chances of reaching the highest-privilege Domain Admin (DA) account in a dynamic AD graph. In contrast, the defender aims to thwart the attacker’s attempts by devising an effective edge-blocking defensive policy. The dynamic nature of AD is characterized by the On/Off presence of HasSession edges (particular edge type in AD). These edges are added to the graph when users log into the system, and the edges remain online until the session ends. We assume each HasSession edge is active with a 50% probability. We define a graph snapshot at time t𝑡titalic_t as a specific state of dynamic AD graph, with all nodes and only active HasSession edges at time t𝑡titalic_t.

We address the attacker-defender problem in dynamic AD graphs by devising a Generalized Reinforcement Learning (GenRL) attacking policy and a Reinforcement Learning-assisted Evolutionary Diversity Optimization (RL-EDO)-based defensive policy. The defender’s RL-EDO generates multiple diverse defenses, while the attacker’s RL agent is trained in parallel across numerous RL environments. In each environment, the attacker faces various AD graph snapshots and one of the defender’s defense strategy. We train the attacker’s RL agent to optimize its success in reaching the DA in each environment. The dynamic nature of AD graphs presents a challenge due to the exponential number of potential snapshots, each representing a possible starting point for the attacker. This results in an exponential number of RL environments when we use RL to train the attacker. However, many of these environments are highly similar due to the high degree of similarity among the snapshots. This similarity allows the knowledge learned from one environment to transfer to others. To train the agent to learn a shared policy and maximize rewards, we train the RL agent across multiple environments in parallel. During this parallel training, the RL agent is exposed to numerous different snapshots in each environment, broadening its exposure to a wider range of possibilities. This helps the attacker learn generalized knowledge applicable to dynamic graph settings, enhancing its ability to navigate the complexities of dynamic AD graphs effectively.

To address the computational challenge of training the attacker policy against an exponential number of RL environments, we propose RL Training Facilitator (TrnF) that performs environment and Neural Network (NN) pruning to streamline the attacker’s RL training process. Environment pruning involves simplifying the RL environment by eliminating elements irrelevant to the attacker’s goals. For instance, if the attacker never utilizes certain edges, then there is no need to track their dynamic changes; thus, we can effectively disregard them. Similarly, Neural Network pruning optimizes NN architectures by reducing the weight of less significant dimensions. The proposed RL training facilitation technique serves to accelerate the attacker’s training pace while also enhancing the performance of the RL agent. Notably, we extensively train the attacker’s policy as it is essential to have a well-trained attacking policy to develop an effective defense policy.

Existing literature consists of solutions for defending dynamic AD graphs via node-blocking strategy [15, 16]. However, the research problem they have considered is different from ours as we focus on defending dynamic AD graphs via edge-blocking defense. Moreover, [15, 16] studied a trivial attacker model where the attacker aims to reach DA via the shortest path, and if this path does not lead directly to the DA, the attack ends. This simplistic approach makes their attacker policy easy to predict, in turn, making it easier for the defender to defend. In contrast, our model presents a challenging planning problem for both attacker and defender. In our model, the attacker encounters a novel game during training that the attacker has never experienced before. Likewise, the defender is unaware of the attacker’s strategies, adding uncertainty to the defense and making our problem more difficult. Consequently, a gap exists in the literature, i.e., the absence of effective edge-blocking strategies for dynamic AD networks, and our work is the first attempt to address this issue.

Our main contributions are summarized as follows:

  • Attacker policy. We propose a Generalized RL attacking policy, trained across multiple RL environments concurrently. This approach accelerates convergence and enhances performance through shared learning experiences.

  • Defender policy. We design an RL-EDO based defensive strategy that generates and optimizes defense mechanisms. Unlike traditional defenses, RL-EDO adapts to sophisticated attack strategies by dynamically replacing ineffective defenses with more robust alternatives.

  • RL training facilitator. To address the scalability challenges in RL training for large AD, we design an innovative RL training facilitator. It optimizes the training process by pruning irrelevant elements from the environment and neural network architectures, ensuring efficient learning without compromising defense effectiveness.

  • Experimental analysis. We perform experiments on varying sizes of AD graphs, i.e., r1000222r1000 represents an AD graph containing 1000 computers., r2000, and r4000. Our results demonstrate that 1) Our proposed attacker-defender approach generates highly effective defense; 2) Our approach accurately models the attacker problem in dynamic AD graphs; 3) Our approach is scalable to very large-scale dynamic AD graphs.

2 Related Work

Defending Active Directories. Guo et al. [8] proposed an FPT algorithm and a graph neural network-based approach for defending AD graphs. In another study, Guo et al. [9] developed a dynamic program and an RL-based approach for hardening large AD graphs. Goel et al. [10] introduced a neural network and EDO-based approach to address the attacker-defender problem, aiming to formulate an effective defensive policy. Goel et al. [11] developed an RL-based attacker policy for hardening large-scale AD graphs. Guo et al. [17] investigated the optimal edge-blocking problem, focusing on strategies that require minimal human intervention. Zhang et al. [14] devised a dual oracle solution for defending AD and evaluated it against industrial solutions. The aforementioned approaches are designed for static graphs and do not effectively address the challenges associated with dynamic AD graphs. Ngo et al. [18] proposed a defensive strategy for placing honeypots on network nodes to defend dynamic AD graphs. In another study, Ngo et al. [15] proposed an EDO-based decoy placement solution for time-varying AD graphs. However, both studies [18, 15] focused on node-blocking strategies for defending dynamic AD. Our objective is to intercept edges rather than nodes, making their solutions inapplicable to our problem.

Autonomous Cyber Defense. CyBORG offers simulated environments for autonomous cyber defense through its 4 CAGE challenges [19, 20, 21, 22]. These challenges aim to enhance the blue agent’s capabilities to defend against red team attack in various scenarios, such as autonomous drone networks and adversarial cyber-physical systems. Various RL approaches [23, 24, 25, 26, 23, 27] have been developed to advance autonomous cyber defense abilities. However, while CyBORG is useful for exploring cyber defence, its small scale and simple structure limit its applicability to real-world networks, which are comparatively larger and more complex. Consequently, solutions designed for CyBORG may not be able to handle the scalability and complexity issues associated with AD graphs.

Evolutionary Diversity Optimization. Hebrard et al. [28] devised a strategy for discovering diverse solutions in constrained programming. Do et al. [29] examined various EDO techniques for permutation problems. Neumann et al. [30] developed EDO algorithms to address the stochastic version of the knapsack problem. Neumann et al. [31] proposed a coevolutionary pareto diversity optimization approach for enhancing constrained single-objective problems. Nikfarjam et al. [32] investigated the integration of EDO algorithms with SAT solvers to maximize diversity in heavily constrained boolean satisfiability problems.

3 Problem Description

We investigate a Stackelberg game involving a single attacker and a defender in a directed dynamic AD graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V represents the nodes and E𝐸Eitalic_E represents the edges. The node set V𝑉Vitalic_V remains constant, while the edge set E𝐸Eitalic_E dynamically changes due to user activities. This dynamism is primarily influenced by the presence or absence of HasSession edges (HE𝐻𝐸H\subseteq Eitalic_H ⊆ italic_E), which are the key reasons for changes in real-world AD graphs. We assume that each HasSession edge is present with a 50% probability. Let C𝐶Citalic_C and U𝑈Uitalic_U denote the set of computers and users in the AD graph, respectively. Authentication data for modelling user activities in AD graph can be denoted as tstart,tend,ui,cjsubscript𝑡startsubscript𝑡endsubscript𝑢𝑖subscript𝑐𝑗\langle t_{\text{start}},t_{\text{end}},u_{i},c_{j}\rangle⟨ italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT end end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩, where tstart,tend,uisubscript𝑡startsubscript𝑡endsubscript𝑢𝑖t_{\text{start}},t_{\text{end}},u_{i}italic_t start_POSTSUBSCRIPT start end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT end end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represent the sign-in time, sign-off time, user, and computer, respectively. In the real-world, attackers may employ tools like SharpHound to extract sign-in and sign-off times from Windows logs. A Graph Snapshot at time t𝑡titalic_t can be represented as Gt=(V,Et)subscript𝐺𝑡𝑉subscript𝐸𝑡G_{t}=(V,E_{t})italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_V , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where Et=e0,t,e1,t,,em1,t,em,tsubscript𝐸𝑡subscript𝑒0𝑡subscript𝑒1𝑡subscript𝑒𝑚1𝑡subscript𝑒𝑚𝑡E_{t}=\langle e_{0,t},e_{1,t},\ldots,e_{m-1,t},e_{m,t}\rangleitalic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ italic_e start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_m - 1 , italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ⟩, with m𝑚mitalic_m denoting the total number of HasSession edges. Each ei,tsubscript𝑒𝑖𝑡e_{i,t}italic_e start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT indicates whether the HasSession edge is active at time t𝑡titalic_t, with 1 representing active and 0 representing inactive. Graph snapshots are represented as Gs={G1,G2,,Gl}subscript𝐺𝑠subscript𝐺1subscript𝐺2subscript𝐺𝑙G_{s}=\{G_{1},G_{2},\ldots,G_{l}\}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }, where l𝑙litalic_l denotes the possible number of snapshots. AD graph comprises s𝑠sitalic_s entry nodes, enabling the attacker to initiate an attack from any of these nodes, and there is a single Domain Admin (DA). The attacker aims to devise an attacking policy to maximize their probability of reaching DA across all possible snapshots. On the other hand, the defender’s goal is to minimize the attacker’s success probability by selectively blocking k𝑘kitalic_k edges, where k𝑘kitalic_k is the defensive budget. Edge blocking in AD is a costly security measure, as it necessitates extensive auditing of access logs to remove edges safely. Consequently, budgets allocated for this process are typically low. Notably, not all edges are blockable; only specific edges labelled as ‘blockable’ can be blocked. Each edge e𝑒eitalic_e in the AD graph is associated with a detection probability, failure probability, and success probability. The detection probability pd(e)subscript𝑝𝑑𝑒p_{d(e)}italic_p start_POSTSUBSCRIPT italic_d ( italic_e ) end_POSTSUBSCRIPT represents the likelihood that an attacker traversing edge e𝑒eitalic_e is detected and subsequently, the attack is terminated. Failure probability pf(e)subscript𝑝𝑓𝑒p_{f(e)}italic_p start_POSTSUBSCRIPT italic_f ( italic_e ) end_POSTSUBSCRIPT indicates the chances that an attacker fails to traverse edge e𝑒eitalic_e for reasons such as being unable to crack a password, etc. In such instances, the attack is not terminated, allowing the attacker to explore other unexplored edges. The success probability ps(e)subscript𝑝𝑠𝑒p_{s(e)}italic_p start_POSTSUBSCRIPT italic_s ( italic_e ) end_POSTSUBSCRIPT denotes the chances of attacker successfully traversing an edge and is calculated as (1pd(e)pf(e))1subscript𝑝𝑑𝑒subscript𝑝𝑓𝑒(1-p_{d(e)}-p_{f(e)})( 1 - italic_p start_POSTSUBSCRIPT italic_d ( italic_e ) end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_f ( italic_e ) end_POSTSUBSCRIPT ). The attacker starts an attack from a starting node and systematically explores unexplored edges to reach the DA until detection, exhaustion of all options, or successfully accessing DA. Additionally, the attacker maintains a Checkpoint set, which records the nodes under their control. This set serves as an alternate plan for continued attack upon failure, enhancing the attacker’s strategic approach.

4 Proposed Attacker-Defender Approach

This section first presents our proposed AD graph optimization technique, followed by RL-based attacking policy, RL training facilitator, and RL assisted evolutionary diversity optimization-based defensive approach. Finally, we discuss our overall attacker-defender strategy.

4.1 Proposed AD Graph Optimization Technique

The original AD graph is highly complex, making it difficult to process in its original state. To address this, we propose a graph optimization technique that utilizes structural features to create a more condensed representation. Below are some terminologies used in creating this condensed graph.

Definition 1.

Splitting nodes are the nodes that have more than one outgoing edge. Split denotes the set of splitting nodes.

Definition 2.

Entry nodes are the starting points from where an attacker can initiate an attack. Entry represents the set of entry nodes.

Definition 3.

Non-Splitting Path (NSP) from node x𝑥xitalic_x to y𝑦yitalic_y is a path that begins at node x𝑥xitalic_x and solely reaches node y𝑦yitalic_y, where y𝑦yitalic_y is the only successor of x𝑥xitalic_x. From node y𝑦yitalic_y, the path extends to its only successor until it reaches the DA or another splitting node [8].

NSP={NSP(x,y)}𝑁𝑆𝑃NSP𝑥𝑦NSP=\{\text{{NSP}}(x,y)\}italic_N italic_S italic_P = { NSP ( italic_x , italic_y ) }

where xSplitEntry𝑥SplitEntryx\in\textsc{Split}\cup\textsc{Entry}italic_x ∈ Split ∪ Entry and ySuccessors(x)𝑦Successors𝑥y\in\textsc{Successors}(x)italic_y ∈ Successors ( italic_x ).

Definition 4.

Block-Worthy edge (BW). A block-worthy edge bw(x,y)𝑏𝑤𝑥𝑦bw(x,y)italic_b italic_w ( italic_x , italic_y ) is an edge on path NSP(x,y)𝑁𝑆𝑃𝑥𝑦NSP(x,y)italic_N italic_S italic_P ( italic_x , italic_y ) that can be blocked and is located farthest away from node x𝑥xitalic_x. The set of block-worthy edges is denoted as:

BW={bw(x,y)}𝐵𝑊𝑏𝑤𝑥𝑦BW=\{bw(x,y)\}italic_B italic_W = { italic_b italic_w ( italic_x , italic_y ) }

where xSplitEntry𝑥SplitEntryx\in\textsc{Split}\cup\textsc{Entry}italic_x ∈ Split ∪ Entry and ySuccessors(x)𝑦Successors𝑥y\in\textsc{Successors}(x)italic_y ∈ Successors ( italic_x ).

A block-worthy edge may be shared among two or more NSPs. For each NSP, we allocate single unit of budget for blocking purposes. Our AD graph optimization approach reduces the initial AD graph with n𝑛nitalic_n nodes and m𝑚mitalic_m edges to a graph containing (|Entry|+|Split|+1)EntrySplit1(|\textsc{Entry}|+|\textsc{Split}|+1)( | Entry | + | Split | + 1 ) nodes and |NSP|NSP|\text{NSP}|| NSP | edges.

4.2 Attacker Approach: Reinforcement Learning

The attacker aims to develop an attacking strategy to optimize their chances of successfully reaching the DA in any given snapshot of AD graph. We design a Generalized Reinforcement Learning (GenRL) based attacking strategy. We concurrently train the RL agent across numerous defensive plans implemented across multiple graph snapshots in separate RL training environments.

Attacker’s Environment. Attacker’s goal of reaching DA can be formalized as a Markov Decision Process (MDP), M=(S,A,R,T)𝑀𝑆𝐴𝑅𝑇M=(S,A,R,T)italic_M = ( italic_S , italic_A , italic_R , italic_T ), where S𝑆Sitalic_S, A𝐴Aitalic_A, R𝑅Ritalic_R, T𝑇Titalic_T represents the state space, action space, reward function, and transition function, respectively. MDP serves as the attacker’s environment and is described below.

  • State space (S). The state space represents the potential states of the attacker, where each state sS𝑠𝑆s\in Sitalic_s ∈ italic_S is a vector of length |NSP|NSP|\text{NSP}|| NSP |, and each element in the state corresponds to an NSP in AD graph. Attacker’s state s𝑠sitalic_s is denoted as:

    s=<F,S,,?,S,?,S,F>Length = # NSP\text{s}=\underbrace{<F,S,\ldots,?,S,?,S,F>}_{\text{Length = \# NSP}}s = under⏟ start_ARG < italic_F , italic_S , … , ? , italic_S , ? , italic_S , italic_F > end_ARG start_POSTSUBSCRIPT Length = # NSP end_POSTSUBSCRIPT (1)

    Here, ‘S𝑆Sitalic_S’ denotes that the attacker tried this NSP and successfully made it to the other end of NSP, ‘F𝐹Fitalic_F’ indicates a failed attempt, and ‘????’ represents that the corresponding NSP has not been tried yet. For a given state s𝑠sitalic_s, the attacker selects an NSP marked as ‘????’, attempts to traverse it and updates its status to ‘S𝑆Sitalic_S’ or ‘F𝐹Fitalic_F’ according to the outcome. The process continues until the attacker reaches DA, gets detected, or exhausts all options. Throughout the attack, attacker’s current state s𝑠sitalic_s serves as a knowledge base and provides information about NSPs under attacker’s control, failed attempt NSPs, and unexplored NSPs. In this way, our sophisticated attacker keeps track of past failed attempts and avoid wasting time on those attempts again, rendering attacker’s strategy more effective. There are two terminating states: 1) If attacker ends up reaching DA, the attack ends. 2) If attacker fails to reach DA due to detection or exhaustion of all options, the attack fails and terminates.

  • Action space (A). For a given state s𝑠sitalic_s, the action space A𝐴Aitalic_A represents the available actions from that state, i.e., NSPs outgoing from successful NSPs in s𝑠sitalic_s. An action aA𝑎𝐴a\in Aitalic_a ∈ italic_A represents an NSP and indicates that the attacker may attempt to traverse the selected NSP to reach DA.

  • Reward function (R). Reward r(s,a)𝑟𝑠𝑎r(s,a)italic_r ( italic_s , italic_a ) for state s𝑠sitalic_s on taking action a𝑎aitalic_a is 1, if the attacker successfully reaches DA. Otherwise, the reward is 0.

  • Transition function (T). For any state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), the transition function executes action a𝑎aitalic_a on state s𝑠sitalic_s, leading to a set of potential future states. Each potential state is linked to a transition probability, which determines the likelihood of transitioning to the specific state.

Training Procedure. We utilize the actor-critic-based Proximal Policy Optimization (PPO) algorithm to train the attacker’s strategy. We chose the PPO algorithm for its actor-critic framework suited to our attacker-defender RL policy and its efficient handling of discrete action spaces crucial for our AD network simulation. The actor network proposes actions to maximize rewards, while the critic network assesses the attacker’s success rate for each state. To ensure robust training on dynamic AD graphs, we utilize 50 graph snapshots per environment, each containing a specific defense strategy devised by the defender. The attacker’s initial state is determined by implementing the defense in one of these snapshots. Subsequently, the RL agent undergoes training against this snapshot with implemented defense. After each episode, a new snapshot is selected from a pool of 50, enabling training against diverse graph scenarios. The RL agent operates concurrently across multiple environments to gather data. At each step within an episode, the agent observes a state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, selects an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the actor network, transitions to a new state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT based on the action, and receives a reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. This process continues until the agent reaches DA or is detected. The training objective is to maximize cumulative rewards and learn a shared policy adaptable across different defense strategies and varying graph configurations. Initially, games may vary across environments, but as the agent learns, it refines its policy to accommodate decreasing differences between environments.

Training Challenge. Training the attacker’s policy for dynamic AD graphs presents a significant challenge due to the exponential number of distinct snapshots available as starting points. The number of these snapshots can grow exponentially, reaching up to 2|NSPs|superscript2𝑁𝑆𝑃𝑠2^{|NSPs|}2 start_POSTSUPERSCRIPT | italic_N italic_S italic_P italic_s | end_POSTSUPERSCRIPT, assuming a 50% probability for each HasSession edge to be active. While not every NSP necessarily includes a HasSession edge, we consider the worst-case scenario where at least one HasSession edge is present in each NSP. However, training the RL agent against every possible 2|NSPs|superscript2𝑁𝑆𝑃𝑠2^{|NSPs|}2 start_POSTSUPERSCRIPT | italic_N italic_S italic_P italic_s | end_POSTSUPERSCRIPT snapshots is impractical due to the large number of NSPs present in real-world AD graphs. To address this challenge, we propose an RL training facilitator designed to streamline and optimize the RL agent’s training process.

4.3 RL Training Facilitator: Pruning Approaches

To address the challenge of training a generalized attacker policy across numerous graph snapshots, we propose the RL Training Facilitator (TrnF) to optimize the efficiency and effectiveness of the RL agent’s training. Specifically, we introduce two pruning approaches aimed at streamlining the training process by removing elements irrelevant to the attacker. This reduces computational overhead and enhances the agent’s capacity to learn effectively.

Environment Pruning via Simplification Agent. We introduce a simplification agent designed to optimize the environment by identifying and removing unnecessary NSPs (which can be labelled as noise) due to their non-utilization by attackers. This agent prunes irrelevant NSPs, thereby reducing the number of potential graph snapshots that the attacker’s policy needs to learn. Initially, we train the RL agent using the attacker’s policy (Section 4.2) and subsequently deploy the simplification agent for environment pruning. The simplification agent analyzes (state, action) pairs across episodes, and if the agent consistently takes specific actions (NSPs) for certain states, then it eliminates unused NSPs. Universally irrelevant NSPs are identified using the trained RL critic network, i.e., NSPs irrelevant across all environments. From this set, a subset of NSPs is randomly selected, and if their removal does not impact the critic value, they are discarded; iterative attempts with different NSP sets are conducted if there is a change in state value. We remove irrelevant NSPs from half of the environments only in order to expose the RL agent to diverse scenarios. After the iterative process, previously removed NSPs are reintroduced to confirm their irrelevance. If the RL agent still does not utilize them, it confirms their irrelevance. This reduction in NSPs limits the starting points for the attacker’s policy learning. Blocking x𝑥xitalic_x irrelevant NSPs reduces starting points by 2xsuperscript2𝑥2^{x}2 start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, thereby accelerating training and optimizing resource allocation.

Neural Network Pruning via Weight Reduction by Fixed Ratio. We propose a NN pruning technique to optimize NN architectures for enhancing RL agent training. This technique selectively reduces weights of less critical dimensions within the NN that correspond to less influential actions. The goal is to streamline the learning process, enabling faster convergence towards optimal weights. By ignoring unnecessary dimensions in input data, the NN reduces noise and expedites training. During training, the NN evaluates the importance of dimensions for actions at split nodes and adjusts weights accordingly. We iteratively block each dimension of a split node and monitor the critic value’s stability. Stable values prompt us to prune the dimension’s weight by a fixed ratio, guiding the NN towards minimizing irrelevant dimension weights. This proactive approach self-corrects weight reduction errors by adjusting weights in subsequent iterations, ensuring reliability. This method actively adjusts weights rather than relying solely on training. Similar to our environment pruning technique, we leverage domain knowledge to identify and reduce unnecessary dimensions, optimizing NN architecture for faster convergence towards optimal weights.

Our RL training facilitator provides several advantages. 1) By prioritizing important NSPs, it accelerates RL policy training and optimizes resource efficiency by focusing on relevant snapshots. 2) Removing irrelevant NSPs filters out noise, improving the accuracy of attacker behaviour modelling for precise decision-making. 3) Integrating the training facilitator enhances the scalability of attacker policies by simplifying both the environment and NN, thereby reducing complexity and enabling quicker convergence. These improvements collectively enhance the performance of the RL agent significantly.

4.4 Defender’s Approach: Reinforcement Learning Assisted Evolutionary Diversity Optimization

To defend dynamic AD graphs, the defensive approach must minimize the attacker’s success rate across all potential AD graph snapshots. However, designing individual defensive strategies for each snapshot is impractical due to the exponential number of possible graph snapshots (2|NSPs|superscript2𝑁𝑆𝑃𝑠2^{|NSPs|}2 start_POSTSUPERSCRIPT | italic_N italic_S italic_P italic_s | end_POSTSUPERSCRIPT). Therefore, our goal is to devise a generalized defensive policy that minimizes the attacker’s success probability across any conceivable snapshot. To address this, we propose a Reinforcement Learning assisted Evolutionary Diversity Optimization (RL-EDO) policy. Our approach involves extracting a set of static graph snapshots, denoted as Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, from the dynamic AD graph and strategically blocking a subset of edges in the AD graph to minimize the attacker’s average success rate across all instances in Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The RL-EDO approach uses EDO to generate multiple diverse, high-quality defenses and allows the attacker to play against these defenses across multiple environments. After training the attacker’s policy, the defender employs the trained RL critic network to evaluate the attacker’s performance against each defense. Defenses that are advantageous for the attacker are replaced with better alternatives, avoiding the computational efforts required to train the policy against these defenses. The defender is constrained to block a maximum of k𝑘kitalic_k edges333Only block-worthy edges can be blocked, as not all edges are blockable.. The defender uses the attacker’s trained RL critic network as a fitness metric for assessing individual defensive strategies. The fitness of a defensive plan indicates the attacker’s success rate when facing that specific defense. The defensive strategy can be depicted as follows:

Defense plan vector=<B,N,...,N,B,B>Length of vector = #Block-worthy edges\text{Defense plan vector}\,=\underbrace{<B,N,\,.\,.\,.\,,\,N,B,B>}_{\text{% Length of vector = \#Block-worthy edges}}Defense plan vector = under⏟ start_ARG < italic_B , italic_N , . . . , italic_N , italic_B , italic_B > end_ARG start_POSTSUBSCRIPT Length of vector = #Block-worthy edges end_POSTSUBSCRIPT (2)

Here, B𝐵Bitalic_B’ and N𝑁Nitalic_N’ represent blocked and non-blocked edges, respectively. The defender creates an initial defense population P𝑃Pitalic_P, each represented as a vector of size |BW|𝐵𝑊|BW|| italic_B italic_W | (Refer to Eq. 2). Each coordinate in the vector is either B𝐵Bitalic_B or N𝑁Nitalic_N, and the count of ‘B𝐵Bitalic_B’s in the vector equals the defensive budget k𝑘kitalic_k. To generate offspring (new defenses), the defender performs mutation or crossover operations with a probability of 0.50.50.50.5 on randomly chosen individuals from P𝑃Pitalic_P. These operations ensure that the total blocked edges remain within the budget k𝑘kitalic_k. Randomness is introduced into the operations by sampling a value x𝑥xitalic_x from a Poisson distribution with a mean of 1. The mutation and crossover operations are performed as follows:

Mutation Operation. A randomly selected individual defense psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from P𝑃Pitalic_P undergoes mutation by swap** x𝑥xitalic_x occurrences of N𝑁Nitalic_N’s with B𝐵Bitalic_B’s and x𝑥xitalic_x occurrences of B𝐵Bitalic_B’s with N𝑁Nitalic_N’s to generate new offspring.

Crossover Operation. Two individuals, psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and p′′superscript𝑝′′p^{\prime\prime}italic_p start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, are randomly selected from P𝑃Pitalic_P. We then identify x𝑥xitalic_x coordinates where psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has N𝑁Nitalic_N’s and p′′superscript𝑝′′p^{\prime\prime}italic_p start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT has B𝐵Bitalic_B’s. We swap the values at these coordinates, replacing N𝑁Nitalic_N’s in psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with B𝐵Bitalic_B’s and B𝐵Bitalic_B’s in p′′superscript𝑝′′p^{\prime\prime}italic_p start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT with N𝑁Nitalic_N’s. Similarly, we identify another set of x𝑥xitalic_x coordinates where psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has B𝐵Bitalic_B’s and p′′superscript𝑝′′p^{\prime\prime}italic_p start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT has N𝑁Nitalic_N’s and perform the swap again, substituting B𝐵Bitalic_B’s with N𝑁Nitalic_N’s and vice versa.

Diversity Optimization in Population. After generating offspring, we evaluate its fitness score and selectively incorporate it into P𝑃Pitalic_P only if its fitness falls within (BEST±0.1)plus-or-minus𝐵𝐸𝑆𝑇0.1(BEST\pm 0.1)( italic_B italic_E italic_S italic_T ± 0.1 ). If the offspring fails to meet this criterion, we reject it, even if it may bring potential diversity benefits to the population. This selective process balances the introduction of new genetic defense while maintaining the population’s superior fitness. Our goal upon adding an individual to P𝑃Pitalic_P is to optimize population diversity by removing the individual that contributes the least to diversity. We define diversity as blocking all edges deemed block-worthy, with the objective of enhancing the diversity of blocked edges across the defense plan population. This metric calculates the frequency with which each block-worthy edge is blocked across the population and aims to achieve an even distribution. In population P𝑃Pitalic_P of μ𝜇\muitalic_μ individuals, each individual pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be represented as follows:

pi=((B/N,bw1),(B/N,bw2),,(B/N,bw|BW|))subscript𝑝𝑖𝐵𝑁𝑏subscript𝑤1𝐵𝑁𝑏subscript𝑤2𝐵𝑁𝑏subscript𝑤𝐵𝑊p_{i}=\big{(}(B/N,bw_{1}),(B/N,bw_{2}),...,(B/N,bw_{|BW|})\big{)}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( ( italic_B / italic_N , italic_b italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_B / italic_N , italic_b italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_B / italic_N , italic_b italic_w start_POSTSUBSCRIPT | italic_B italic_W | end_POSTSUBSCRIPT ) )

Here, ‘B𝐵Bitalic_B’ denotes the blocked status of the block-worthy edge, ‘N𝑁Nitalic_N’ denotes the non-blocked status, and i1,,μ𝑖1𝜇i\in{1,...,\mu}italic_i ∈ 1 , … , italic_μ. We compute the count of individuals who have blocked each block-worthy edge bwj𝑏subscript𝑤𝑗bw_{j}italic_b italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where j1,,|BW|𝑗1𝐵𝑊j\in{1,...,|BW|}italic_j ∈ 1 , … , | italic_B italic_W |. This count is represented by the vector C(bw)𝐶𝑏𝑤C(bw)italic_C ( italic_b italic_w ) and is calculated as:

C(bw)=(c(bw1),c(bw2),,c(bw|BW|))𝐶𝑏𝑤𝑐𝑏subscript𝑤1𝑐𝑏subscript𝑤2𝑐𝑏subscript𝑤𝐵𝑊C(bw)=(c(bw_{1}),c(bw_{2}),...,c(bw_{|BW|}))italic_C ( italic_b italic_w ) = ( italic_c ( italic_b italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_c ( italic_b italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_c ( italic_b italic_w start_POSTSUBSCRIPT | italic_B italic_W | end_POSTSUBSCRIPT ) )

Here, c(bw1)𝑐𝑏subscript𝑤1c(bw_{1})italic_c ( italic_b italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) represents the count of individuals out of μ𝜇\muitalic_μ that have blocked the bw1𝑏subscript𝑤1bw_{1}italic_b italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT edge. The diversity of P𝑃Pitalic_P without including individual pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented by vector D(C(bw)\pi)𝐷\𝐶𝑏𝑤subscript𝑝𝑖D(C(bw)\backslash{p_{i}})italic_D ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and is calculated as:

D(C(bw)\pi)=C(bw)pi𝐷\𝐶𝑏𝑤subscript𝑝𝑖𝐶𝑏𝑤subscript𝑝𝑖D(C(bw)\backslash{p_{i}})=C(bw)-p_{i}italic_D ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_C ( italic_b italic_w ) - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

The defender aims to maximize the diversity of blocked edges and compute SortedD(C(bw)\pi)𝑆𝑜𝑟𝑡𝑒𝑑𝐷\𝐶𝑏𝑤subscript𝑝𝑖SortedD(C(bw)\backslash{p_{i}})italic_S italic_o italic_r italic_t italic_e italic_d italic_D ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as:

SortedD(C(bw)\pi)=sort(D(C(bw)\pi))SortedD\𝐶𝑏𝑤subscript𝑝𝑖sort𝐷\𝐶𝑏𝑤subscript𝑝𝑖\displaystyle\text{{SortedD}}(C(bw)\backslash{p_{i}})=\text{sort}\Big{(}D(C(bw% )\backslash{p_{i}})\Big{)}SortedD ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = sort ( italic_D ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

To optimize the population diversity, we minimize SortedD(C(bw)\pi)𝑆𝑜𝑟𝑡𝑒𝑑𝐷\𝐶𝑏𝑤subscript𝑝𝑖SortedD(C(bw)\backslash{p_{i}})italic_S italic_o italic_r italic_t italic_e italic_d italic_D ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in descending lexicographic order. We achieve this by eliminating the individual hhitalic_h with the lowest SortedD(C(bw)\ph)𝑆𝑜𝑟𝑡𝑒𝑑𝐷\𝐶𝑏𝑤subscript𝑝SortedD(C(bw)\backslash{p_{h}})italic_S italic_o italic_r italic_t italic_e italic_d italic_D ( italic_C ( italic_b italic_w ) \ italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) score to maximize population’s diversity. If removing individual hhitalic_h increases diversity and its fitness value is far from the best, we remove it from the population. Conversely, if newly generated offspring attain the best fitness value, they are included in P𝑃Pitalic_P regardless of their diversity, while the individual with the worst fitness value is eliminated. This approach enables the defender to create a diverse yet high-quality blocking plan.

Refer to caption
Figure 2: Proposed RL-based Attacker-Defender Approach for Dynamic Networks.

4.5 Overall Attacker-Defender Approach

The defender uses RL-EDO to create multiple diverse defense plans for each of the attacker’s RL environments. Each RL environment contains a defense from the defender and multiple graph snapshots. Our RL agent undergoes parallel training across these RL environments, with each environment containing a defense plan from the defender implemented in numerous graph snapshots. We adopt a rotation mechanism that switches the graph snapshot the RL agent faces after each training episode. Our goal is to train a generalized attacking policy capable of achieving high success rates regardless of the AD snapshots. To address the computational challenge of training the attacker policy against numerous snapshots, we design an RL training facilitator that performs environment and NN pruning to simplify the RL training process. In environment pruning, we eliminate universally irrelevant NSPs, and in NN pruning, we reduce the weights of less important dimensions. We first train the RL agent to learn the optimal actions, and once it is trained, we perform the pruning steps. After pruning, we retrain the RL agent to optimize actions on the reduced graph. This iterative process of pruning and training ensures more efficient learning across multiple snapshots. After this iterative process, we reintroduce previously removed NSPs to verify their relevance by retraining the RL agent on the updated graph. The defender continuously evaluates the current set of defensive plans, replacing those that are advantageous for the attacker with superior alternatives. Notably, the attacker’s RL critic network is utilized by the attacker’s policy for implementing pruning techniques and the defender’s policy for assessing defensive strategies. Additionally, the diversity factor helps the RL agent avoid local optima and facilitates more precise learning of the attacking policy. Overall, the parallel gameplay between the attacker and defender helps each other’s policies to improve. Figure 2 illustrates our overall proposed approach.

5 Experimental Results

We evaluate the effectiveness of our proposed attacker-defender strategy on synthetic AD graphs of varying sizes. We conduct the experiments on a high-performance cluster server, dedicating one CPU and 20 cores per trial. We perform experiments on five distinct AD graphs (using 5 different seeds ranging from 0 to 4, reporting an average over five seeds), each with varying blockable edges and entry nodes. We implemented the code in PyTorch.

5.1 Synthetic AD Graph Dataset

In this study, we employ synthetic AD graph datasets to evaluate the effectiveness of our proposed attacker-defender strategies. Given the sensitive nature and limited accessibility of real-world AD data, synthetic datasets provide a crucial advantage in enabling controlled experiments and systematic exploration of cybersecurity strategies. We utilize DBCreator tool, which is a popular tool to generate synthetic AD graphs of three different sizes: r1000, r2000, and r4000, containing 1000, 2000, and 4000 computers in the graph, respectively. Details of the dataset are provided in Table 1. We consider three primary edge types present in BloodHound: AdminTo, HasSession, and MemberOf. For simulating attacker behaviour, we randomly select 20 starting nodes from a set of 40 nodes located at the maximum distance from the DA, ensuring coverage across a wide spectrum of potential attack paths. The probability of blocking an edge is determined based on its distance from the DA, i.e., edges farther away are more likely to be blocked. This probability is calculated as the ratio of minimum #hops between e𝑒eitalic_e and DA to the maximum #hops between any edge and DA. We preprocess the AD graph by consolidating multiple DA nodes into one, removing all incoming edges to starting nodes and outgoing edges from the DA node alongside the inaccessible nodes. Each NSP is treated as a single edge.

Table 1: Description of AD dataset.
AD graph |V| |E|
r1000 2996 8814
r2000 5997 18795
r4000 12001 45780

Correlation Distributions. We analyze the impact of the correlation between each edge’s detection probability (pd(e)subscript𝑝𝑑𝑒p_{d(e)}italic_p start_POSTSUBSCRIPT italic_d ( italic_e ) end_POSTSUBSCRIPT) and failure probability (pf(e)subscript𝑝𝑓𝑒p_{f(e)}italic_p start_POSTSUBSCRIPT italic_f ( italic_e ) end_POSTSUBSCRIPT) on the attacker’s success probability under three distributions: Independent (Ind), Positive correlation (Pos), and Negative correlation (Neg). In the Independent distribution, pd(e)subscript𝑝𝑑𝑒p_{d(e)}italic_p start_POSTSUBSCRIPT italic_d ( italic_e ) end_POSTSUBSCRIPT and pf(e)subscript𝑝𝑓𝑒p_{f(e)}italic_p start_POSTSUBSCRIPT italic_f ( italic_e ) end_POSTSUBSCRIPT are uniformly distributed between 0 and 0.2. In the Positive correlation distribution, pd(e)subscript𝑝𝑑𝑒p_{d(e)}italic_p start_POSTSUBSCRIPT italic_d ( italic_e ) end_POSTSUBSCRIPT and pf(e)subscript𝑝𝑓𝑒p_{f(e)}italic_p start_POSTSUBSCRIPT italic_f ( italic_e ) end_POSTSUBSCRIPT follow a multivariate normal distribution with mean μ=[0.1,0.1]𝜇0.10.1\mu=[0.1,0.1]italic_μ = [ 0.1 , 0.1 ] and covariance matrix Σ=[[0.052,0.5×0.052],[0.5×0.052,0.052]]Σ0.0520.50.0520.50.0520.052\Sigma=\left[\left[0.052,0.5\times 0.052\right],\left[0.5\times 0.052,0.052% \right]\right]roman_Σ = [ [ 0.052 , 0.5 × 0.052 ] , [ 0.5 × 0.052 , 0.052 ] ]. For the Negative correlation distribution, pd(e)subscript𝑝𝑑𝑒p_{d(e)}italic_p start_POSTSUBSCRIPT italic_d ( italic_e ) end_POSTSUBSCRIPT and pf(e)subscript𝑝𝑓𝑒p_{f(e)}italic_p start_POSTSUBSCRIPT italic_f ( italic_e ) end_POSTSUBSCRIPT follow a multivariate normal distribution with mean μ=[0.1,0.1]𝜇0.10.1\mu=[0.1,0.1]italic_μ = [ 0.1 , 0.1 ] and covariance matrix Σ=[[0.052,0.5×0.052],[0.5×0.052,0.052]]Σ0.0520.50.0520.50.0520.052\Sigma=\left[\left[0.052,-0.5\times 0.052\right],\left[-0.5\times 0.052,0.052% \right]\right]roman_Σ = [ [ 0.052 , - 0.5 × 0.052 ] , [ - 0.5 × 0.052 , 0.052 ] ].

5.2 Training Parameters

For defender, the edge-blocking budget is set at 5. The defender generates a population of 20 blocking plans over 20,000 iterations444Edge-blocking is expensive due to the need to securely audit access logs for edge deletion; therefore, the budget is generally low.. We set the crossover and mutation probabilities to 0.5. For training the attacker’s policy, we implement RL environments using OpenAI Gym [33] and employ the PPO algorithm for training the RL agent. We implement the actor and critic networks using multi-layer NN. The model is optimized with an Adam optimizer using a learning rate of 0.0005, a batch size of 800 states, and a hidden layer size of 128. For PPO-specific hyperparameters, we follow the standard settings from the original paper [34]. We train the RL policy concurrently across 20 RL environments. Upon reaching the termination criterion, defender selects the defense with the least attacker success rate as the Best Defense. To simulate the dynamic behaviour of AD graph, each HasSession edge is randomly added to graph with a probability of 0.5.

5.3 Attacker-Defender Policy Training

We train the attacker-defender approach for 1200 epochs, spanning approximately 4-5 days to complete the training process. During this process, the defender generates 20 diverse defensive plans, against which the attacker plays concurrently. Each RL environment contains 50 graph snapshots and a specific defense from defender. The attacker’s policy undergoes continuous training, while the defender evaluates and resets defensive environments after every 50 epochs. To facilitate the training of RL agent across multiple graph snapshots, we generate 50 different graph snapshots for each environment and load a new snapshot from the pool of 50 to train the policy against diverse scenarios across all 20 environments. Upon reaching the termination condition, the defender evaluates the performance of 20 defensive plans using the RL critic network, selecting the best plan based on the lowest attacker success rate. For our proposed training facilitator, within every 50 epochs before the defender evaluates and resets the environments, we perform environment pruning in the 30th and 40th epochs. In the 45th epoch, we reintroduce all removed edges, and training continues for 5 more epochs to confirm the irrelevance of the removed edges. Furthermore, we conduct environment pruning in 10 environments to expose the RL agent to both pruned and original environments. Additionally, our NN pruning technique gradually reduces the weights of less important dimensions by 2% after every 10 minutes.

  1. 1.

    GenRL-TrnF+RL-EDO (Proposed). GenRL is employed as attacker’s policy, utilizing a training facilitator to support RL agent’s training process. Meanwhile, defender employs RL-EDO approach to generate defense strategies.

  2. 2.

    GenRL+C-EDO [11]. GenRL is employed as the attacker’s policy, while EDO is utilized as the defender’s policy. Notably, the attacker GenRL policy in this approach operates without the support of the training facilitator.

5.4 Evaluating Attacker’s Policy

In this setup, our objective is to evaluate the performance of our proposed generalized attacking policy, GenRL-TrnF, and assess the impact of our training facilitator on the RL agent’s learning capacity for dynamic AD graphs.

Table 2: Comparison of various attacker policies with 50 specialized trained RL agents (Attacker’s values closer to 50 SpecRL Agents indicate superior policy performance).
Attacker Success Rate Time (hour)
Graph Attacker policy Ind Pos Neg Avg Ind Pos Neg
50 SpecRL Agents 51.83 53.76 51.76 52.45 63.73 60.02 58.09
r1000 GenRL-TrnF (Proposed) 54.69 49.63 53.31 52.54 18.56 19.29 21.33
GenRL 46.27 45.94 45.97 46.06 15.24 13.51 16.26
50 SpecRL Agents 42.31 45.52 39.97 42.60 71.54 64.68 63.59
r2000 GenRL-TrnF (Proposed) 40.45 41.91 42.62 41.66 23.81 25.55 26.34
GenRL 35.89 35.74 36.55 36.06 18.49 19.92 18.85
50 SpecRL Agents 29.04 31.37 29.14 29.85 78.25 73.91 75.03
GenRL-TrnF (Proposed) 32.51 25.83 25.95 28.09 26.02 28.43 28.29
r4000 GenRL 23.48 20.29 18.78 20.85 23.66 22.72 23.37

Baselines. The comparative attacker’s policies are:

  • GenRL-TrnF(Proposed). A single generalized RL agent serves as the attacker’s policy, trained to adapt to 50 distinct graph snapshots with the support of a training facilitator to enhance its training process.

  • GenRL. A single generalized RL agent serves as attacker’s policy, learning from 50 graph snapshots independently, without using any training facilitator.

  • 50 SpecRL Agents. RL is employed as attacker’s policy without a training facilitator. Instead of using a single generalized agent, 50 distinct RL agents are trained, each dedicated to a specific snapshot. This approach aims to develop a more sophisticated attack strategy tailored to diverse scenarios.

Results. To assess the performance of the attacker’s policy, we deploy the best defense derived from our GenRL-TrnF+RL-EDO approach across 50 random graph snapshots. Both GenRL-TrnF and GenRL attacking policies are trained on these snapshots against the best defense for 200 epochs, and we evaluate the performance over 5000 episodes to measure effectiveness. GenRL-TrnF performs environment pruning every 30 epochs and NN pruning of 2% every 5 minutes. We compare a single generalized RL agent trained across all snapshots against 50 specialized RL agents (50 SpecRL Agents), aiming to quantify performance differences and identify the best attacker strategies. Results averaged over five seeds (0 to 4) of AD graphs are presented in Table 2. Our proposed GenRL-TrnF consistently outperforms the GenRL attacking policy and closely matches the performance of 50 SpecRL Agents across all graph scales. For example, on the r1000 AD graph (Ind distribution), GenRL-TrnF achieves a 54.69% success rate, deviating by only 2.86% from 50 SpecRL Agents, while GenRL achieves 46.27%, deviating notably by 5.56%. Similarly, on the r2000 AD graph (Ind distribution), GenRL-TrnF achieves a 40.45% success rate with a smaller deviation of 1.86%, compared to GenRL’s 35.89% success rate with a deviation of 6.42% from 50 SpecRL Agents. The deviation of GenRL-TrnF and GenRL from 50 SpecRL Agents is illustrated in Figure 3. Our findings demonstrate that integrating a training facilitator into a generalized attacker policy enables GenRL-TrnF to perform competitively with 50 specialized RL agents, showing only slight deviations in success rate. Conversely, GenRL struggles to generalize effectively across attacker problem accurately, underscoring the crucial role of the training facilitator in enhancing RL policy efficacy by accurately modelling dynamic attacker behaviours.

Refer to caption
Figure 3: Comparison of deviation from 50 specialized agents across various attacker policies (smaller deviations indicate superior performance).

5.5 Evaluating Defender’s Policy

This section assesses the performance of our proposed defensive strategy.

Baseline. We compare the best defense from our GenRL-TrnF+RL-EDO approach with the GenRL+C-EDO approach [11]. In the GenRL-TrnF+RL-EDO approach, GenRL serves as the attacker’s policy with the support of a training facilitator, while the defender utilizes RL-EDO. In contrast, the GenRL+C-EDO approach employs RL alone for the attacker without any training facilitator, coupled with C-EDO for the defender.

Results. We train both approaches, GenRL-TrnF+RL-EDO and GenRL+C-EDO, using the attacker-defender approach discussed in section 5.3 to obtain the best defense. Subsequently, we generate 50 random AD graph snapshots. For each snapshot, we train one specialized RL agent integrated with the training facilitator (GenRL-TrnF) to play against the best defense obtained. We train each specialized RL agent for 200 epochs and we evaluate the GenRL-TrnF policy’s performance against the best defense through simulations over 5000 episodes. The average success rate across 50 trained agents is reported in Table 3. The defense yielding the minimal attacker success rate is identified as the best-generalized defense. Our results consistently show that the defense from GenRL-TrnF+RL-EDO outperforms the defense from GenRL+C-EDO in reducing the attacker’s success rates across all graph instances. For instance, on the r2000 graph (Ind distribution), the attacker success rate against the GenRL-TrnF+RL-EDO defense is 42.37%, lower than the 46.05% success rate against the GenRL+C-EDO defense. Similarly, for r1000 and r4000 AD graphs, the defense from the GenRL-TrnF+RL-EDO approach effectively reduces the attacker’s success rate compared to the GenRL+C-EDO approach. Our results demonstrate that the proposed GenRL-TrnF+RL-EDO approach consistently generates superior defense against dynamic AD graphs compared to the baseline approach555The dynamic nature of AD graph problem poses significant challenges to achieving substantial reductions in attacker success rates. Even marginal decreases in success rates can yield significant benefits, considering the substantial costs associated with security breaches for organizations..

Table 3: Comparative analysis of best defense from various attacker-defender approaches (smaller values indicate superior performance).
Attacker Success Rate
Graph Best defense from Policy Ind Pos Neg Avg
GenRL-TrnF+RL-EDO (Proposed) 56.24 51.02 55.97 54.41
r1000 GenRL+C-EDO 59.03 56.13 57.21 57.45
GenRL-TrnF+RL-EDO (Proposed) 42.37 42.43 44.65 43.15
r2000 GenRL+C-EDO 46.05 43.51 45.80 45.12
GenRL-TrnF+RL-EDO (Proposed) 33.96 27.45 27.01 29.47
r4000 GenRL+C-EDO 35.72 28.59 28.43 30.91

5.6 Discussion

Our empirical findings underscore the superior performance of our GenRL-TrnF attacker policy compared to baseline approaches. This improvement is primarily attributed to our innovative training facilitator, which enhances the efficiency of attacker training by systematically pruning irrelevant elements, guided by extensively trained RL agents. We further validated the irrelevance of these elements using a trained RL critic network, ensuring that the critic value before and after removal remains the same. As a result, our generalized GenRL-TrnF attacker policy achieves performance levels comparable to specialized agents without compromising critical network dynamics. By focusing on relevant elements identified through RL agent training, we reduce computational load and strengthen learning capacity. Furthermore, our defensive strategy significantly reduces the attacker’s success rate through extensive training augmented by the training facilitator. Our integrated attacker-defender approaches reinforce each other, where a more robust attacking policy contributes to a resilient defense. Concurrent training of the RL attacker policy across multiple environments enables quicker learning of shared policies. Our proposed defense strategy demonstrates versatility in enhancing network security across various sectors: enterprise networks prevent unauthorized lateral movement, cloud environments safeguard resources and data integrity, IoT mitigates cyber-physical risks, and critical sectors like utilities, healthcare, and financial systems ensure operational resilience. Our model currently includes three edge types: AdminTo, HasSession, and MemberOf. However, real-world AD environments feature a broader range of edge types, which limits our ability to fully capture their complexities and vulnerabilities. Our future research aims to expand the model to encompass additional edge types, thereby enhancing the accuracy of our simulations and defense strategies for AD environments. Although synthetic AD graphs effectively simulate key aspects of real-world environments, they have inherent limitations in replicating complex dynamics and vulnerabilities. Validation against real AD datasets is essential to ensure the generalizability of our findings to practical cybersecurity scenarios. In future research, we will focus on validating our results using real-world AD datasets.

6 Conclusion

In this study, we proposed a dual RL-based strategy for both attacker and defender within dynamic AD graphs. Our innovative training facilitator simplifies the AD graph and neural network structures, enhancing the overall efficacy of our training policy and ensuring scalability to large AD graphs. We conducted experiments on dynamic AD graphs of three different scales: r1000, r2000, and r4000. The empirical evidence demonstrates the superior performance of our approach compared to the baseline, significantly improving both the attacker’s and defender’s performance in dynamic network settings.

References

  • Nandi et al. [2016] Apurba K Nandi, Hugh R Medal, and Satish Vadlamani. Interdicting attack graphs to protect organizations from cyber attacks: A bi-level defender–attacker model. Computers & Operations Research, 75:118–131, 2016.
  • Ahmad et al. [2023] Hussain Ahmad, Isuru Dharmadasa, Faheem Ullah, and Muhammad Ali Babar. A review on c3i systems’ security: Vulnerabilities, attacks, and countermeasures. ACM Computing Surveys, 55(9):1–38, 2023.
  • Applebaum et al. [2022] Andy Applebaum, Camron Dennler, Patrick Dwyer, Marina Moskowitz, Harold Nguyen, Nicole Nichols, Nicole Park, Paul Rachwalski, Frank Rau, Adrian Webster, et al. Bridging automated to autonomous cyber defense: Foundational analysis of tabular q-learning. In Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security, pages 149–159, 2022.
  • Nguyen and Reddi [2021] Thanh Thi Nguyen and Vijay Janapa Reddi. Deep reinforcement learning for cyber security. IEEE Transactions on Neural Networks and Learning Systems, 34(8):3779–3795, 2021.
  • Molina-Markham et al. [2021] Andres Molina-Markham, Ransom K Winder, and Ahmad Ridley. Network defense is not a game. arXiv preprint arXiv:2104.10262, 2021.
  • cag [2022a] Cyber operations research gym. https://github.com/cage-challenge/CybORG, 2022a. Created by Maxwell Standen, David Bowman, Son Hoang, Toby Richer, Martin Lucas, Richard Van Tassel, Phillip Vu, Mitchell Kiely, KC C., Natalie Konschnik, Joshua Collyer.
  • Team [2021] Microsoft Defender Research Team. Cyberbattlesim. https://github.com/microsoft/cyberbattlesim, 2021. Created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei.
  • Guo et al. [2022] Mingyu Guo, Jialiang Li, Aneta Neumann, Frank Neumann, and Hung Nguyen. Practical fixed-parameter algorithms for defending active directory style attack graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9360–9367, 2022.
  • Guo et al. [2023] Mingyu Guo, Max Ward, Aneta Neumann, Frank Neumann, and Hung Nguyen. Scalable edge blocking algorithms for defending active directory style attack graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, (2023), 2023.
  • Goel et al. [2022] Diksha Goel, Max Hector Ward-Graham, Aneta Neumann, Frank Neumann, Hung Nguyen, and Mingyu Guo. Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’22, page 1191–1199, 2022.
  • Goel et al. [2023] Diksha Goel, Aneta Neumann, Frank Neumann, Hung Nguyen, and Mingyu Guo. Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’23, page 1348–1356, 2023.
  • Microsoft [2023] Microsoft. Microsoft digital defense report, 2023. URL https://www.microsoft.com/en/security/security-insider/microsoft-digital-defense-report-2023/.
  • Dunagan et al. [2009] John Dunagan, Alice X Zheng, and Daniel R Simon. Heat-ray: combating identity snowball attacks using machinelearning, combinatorial optimization and attack graphs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 305–320, 2009.
  • Zhang et al. [2023] Yumeng Zhang, Max Ward, Mingyu Guo, and Hung Nguyen. A scalable double oracle algorithm for hardening large active directory systems. In The 18th ACM ASIA Conference on Computer and Communications Security (ACM ASIACCS), Melbourne, Australia, 2023, 2023.
  • Ngo et al. [2024] Huy Q Ngo, Mingyu Guo, and Hung Nguyen. Optimizing cyber response time on temporal active directory networks using decoys. arXiv preprint arXiv:2403.18162, 2024.
  • Huang et al. [2023] Chen Huang, Xiangbing Zhou, Xiaojuan Ran, Yi Liu, Wuquan Deng, and Wu Deng. Co-evolutionary competitive swarm optimizer with three-phase for large-scale complex optimization problem. Information Sciences, 619:2–18, 2023.
  • Guo et al. [2024] Mingyu Guo, Jialiang Li, Aneta Neumann, Frank Neumann, and Hung Nguyen. Limited query graph connectivity test. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20718–20725, 2024.
  • Ngo et al. [2023] Quang Huy Ngo, Mingyu Guo, and Hung Nguyen. Near optimal strategies for honeypots placement in dynamic and large active directory networks. In The 22nd International Conference on Autonomous Agents and Multiagent Systems, 2023. Extended Abstract.
  • cag [2021] CAGE Challenge 1, 2021. arXiv.
  • cag [2022b] TTCP CAGE Challenge 2, 2022b.
  • Group [2022] TTCP CAGE Working Group. Ttcp cage challenge 3. https://github.com/cage-challenge/cage-challenge-3, 2022.
  • Group [2023] TTCP CAGE Working Group. Ttcp cage challenge 4. https://github.com/cage-challenge/cage-challenge-4, 2023.
  • Foley et al. [2022] Myles Foley, Chris Hicks, Kate Highnam, and Vasilios Mavroudis. Autonomous network defence using reinforcement learning. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, pages 1252–1254, 2022.
  • Heckel [2023] Kade Heckel. Neuroevolution for autonomous cyber defense. In Proceedings of Companion Conference on Genetic and Evolutionary Computation, pages 651–654, 2023.
  • Hicks et al. [2023] Chris Hicks, Vasilios Mavroudis, Myles Foley, Thomas Davies, Kate Highnam, and Tim Watson. Canaries and whistles: Resilient drone communication networks with (or without) deep reinforcement learning. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 91–101, 2023.
  • Bates et al. [2023] Elizabeth Bates, Vasilios Mavroudis, and Chris Hicks. Reward sha** for happier autonomous cyber security agents. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 221–232, 2023.
  • Prébot et al. [2022] Baptiste Prébot, Yinuo Du, Xiaoli Xi, and Cleotilde Gonzalez. Cognitive models of dynamic decision in autonomous intelligent cyber defense. In International Conference on Autonomous Intelligent Cyber-defense Agents, Bordeaux, France, 2022.
  • Hebrard et al. [2005] Emmanuel Hebrard, Brahim Hnich, Barry O’Sullivan, and Toby Walsh. Finding diverse and similar solutions in constraint programming. In AAAI, volume 5, pages 372–377, 2005.
  • Do et al. [2022] Anh Do, Mingyu Guo, Aneta Neumann, and Frank Neumann. Analysis of evolutionary diversity optimization for permutation problems. ACM Transactions on Evolutionary Learning, 2(3):1–27, 2022.
  • Neumann et al. [2022a] Aneta Neumann, Yue Xie, and Frank Neumann. Evolutionary algorithms for limiting the effect of uncertainty for the knapsack problem with stochastic profits. International Conference on Parallel Problem Solving from Nature, 2022a.
  • Neumann et al. [2022b] Aneta Neumann, Denis Antipov, and Frank Neumann. Coevolutionary pareto diversity optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 832–839, 2022b.
  • Nikfarjam et al. [2023] Adel Nikfarjam, Ralf Rothenberger, Frank Neumann, and Tobias Friedrich. Evolutionary diversity optimisation in constructing satisfying assignments. In Proceedings of the Genetic and Evolutionary Computation Conference, page 938–945, 2023.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.