Anomaly-based Framework for Detecting Power Overloading Cyberattacks in Smart Grid AMI

Abdelaziz Amara Korba Nouredine Tamani Yacine Ghamri-Doudane Nour El Islem karabadji Networks and Systems Laboratory (LRS), Badji Mokhtar-Annaba University, Annaba, Algeria Higher School of industrial technologies, Annaba, P.O. Box 218, 23000, Algeria. Electronic Document Management Laboratory (LabGED), Badji Mokhtar-Annaba University, Annaba, Algeria L3i, Univ. of la Rochelle, La Rochelle, France
Abstract

The Advanced Metering Infrastructure (AMI) is one of the key components of the smart grid. It provides interactive services for managing billing and electricity consumption, but it also introduces new vectors for cyberattacks. Although, the devastating and severe impact of power overloading cyberattacks on smart grid AMI, few researches in the literature have addressed them. In the present paper, we propose a two-level anomaly detection framework based on regression decision trees. The introduced detection approach leverages the regularity and predictability of energy consumption to build reference consumption patterns for the whole neighborhood and each household within it. Using a reference consumption pattern enables detecting power overloading cyberattacks regardless of the attacker’s strategy as they cause a drastic change in the consumption pattern. The continuous two-level monitoring of energy consumption load allows efficient and early detection of cyberattacks. We carried out an extensive experiment on a real-world publicly available energy consumption dataset of 500 customers in Ireland. We extracted, from the raw data, the relevant attributes for training the energy consumption patterns. The evaluation shows that our approach achieves a high detection rate, a low false alarm rate, and superior performances compared to existing solutions.

keywords:
Smart grid , Advanced Metering Infrastructure (AMI) , Overloading cyberattacks , Anomaly detection.

1 Introduction

Information and communication technologies played a crucial role in the growth and performance of the smart grids. The advanced metering infrastructure provides a two-way communication network between smart meters and utility systems, offering interactive services for managing billing and electricity consumption. However, interconnecting the smart grid distributed elements, also introduces new vectors for cyberattacks. The first successful cyberattack on power grid is recorded in December 2015, it struck the Ukraine power grid causing power outages putting more than 100 cities in the dark. The hackers exploited vulnerable points in the infrastructure using a piece of malware known as Black Energy. Several other cyberattacks have followed showing how a hacker with a piece of malware can take control of a power plant’s circuit breaker and damage generators.

Power overloading is one of the most severe cyberattacks, it aims at increasing the energy load to disrupt the load balance on the local power grid, cause a blackout, and damage the grid infrastructure. An attacker with low cost equipment could exploit security vulnerabilities within some points in the smart grid communication infrastructure, particularly within smart meter. The exploit may grant the attacker the command and control of thousands of smart meters that he can subsequently use to dramatically increase the demand of electricity, and to disrupt the load balance on the local power grid. The attacker can also compromise the communication infrastructure or hack the substation, and then send fake pricing information to the local community. By exploiting the vulnerability of load control systems, an attacker can modify the consumption profile of the customers and the whole neighborhood (more details are provided in section 4). Although, the severe impact of power overloading cyberattacks, few works [1, 2, 3, 4] in the literature have addressed them.

Traditional anomaly detection systems based on network features did not consider attack scenarios and inherent characteristics of the smart grid AMI. Current smart grid AMI anomaly detection systems consider fault detection [5], and address mainly two types of cyberattacks: electricity theft and pricing cyberattacks [6, 7, 8]. The goal of energy theft is to pay less than the actual price for the consumed energy, in this case the attacker can physically tamper the smart meter, or compromise the communication infrastructure [9]. Two common pricing cyberattacks on smart home systems, which manipulate guideline pricing have been studied in the literature [2, 3, 4]. In the first one, named cyberattack for bill reduction, the attacker attempts to fake the guideline pricing curve such that it can reduce the cost of his own bill at the cost of bill increase of other customers. The goal of the second cyberattack is to create a peak energy load in the local community.

Most of the existing anomaly detection systems for smart grid AMI in the literature have been proposed for energy theft detection [10, 11, 12, 13, 14, 15, 16] and pricing cyberattacks [2, 3, 4]. However, few researches have addressed overloading cyberattacks targeting grid blackout. Although some works [1, 2, 3, 4] addressed grid overloading cyberattacks, they only considered short term load increase resulting from pricing manipulation. In this case, the attacker’s goal is to make profit not to shut down the power grid. Anomaly detection systems which monitor the guideline pricing curve are only effective if the attacker overloads the grid by manipulating the guideline pricing curve. Otherwise, the grid overloading cyberattacks would not be detected if the attacker changes its strategy.

Some existing anomaly detection systems proposed in the literature deal with energy theft and grid overloading in the same way. Although energy theft and grid overloading both correspond to an abnormality in the consumption pattern, the two attacks have their proper subtleties and differ in several points such as: attacker’s operating mode, detection delay, and impact on the AMI. Unlike energy theft, grid overloading cyberattack causes an immediate damage, therefore the detection delay is critical in this case. The impact of grid overloading cyberattacks is beyond the smart home, therefore monitoring the energy load at a neighborhood level is needed to avoid cascading failure.

Most of the existing anomaly detection systems in the literature used classification algorithms. One issue with classification-based anomaly detection is the unavailability of malicious samples. Using synthetic malicious samples can solve this issue, however the classifier would not detect unseen attacks that significantly deviate from the synthetic malicious samples used to train the system.

In this paper, we tackle the issue of grid overloading cyberattacks against the smart grid AMI. We propose a consumption pattern-based anomaly detection framework (CPADF) to detect and prevent grid overloading cyberattacks. CPADF applies features engineering and regression-based learning algorithms on historical consumption data to generate normal consumption patterns for the whole neighborhood and for each customer within it. The obtained trained models are then harnessed in a decision making process, we developed, to detect anomalies in consumption patterns. To do so, CPADF monitors continuously electricity consumption at both home and neighborhood level and aggregates anomaly alerts received from customers. An abnormal consumption raise is then detected at a given level if the observed consumption does not match up with its corresponding normal consumption pattern.

We carried out experiments on a real-world publicly available energy consumption dataset of 500 customers in Ireland. We proceeded with data cleaning and feature extraction on the raw data. Initially, the dataset provides 3 attributes only (smart meter identifier, timestamp, and energy consumption) from which we extracted some relevant attributes for training the energy consumption patterns, such as day time, day type, month and season, and we generated labelled datasets for both home and neighborhood levels for the model training and testing111The generated labelled datasets are made available for free upon request.. The evaluation shows that our approach achieves a high detection rate, a low false alarm rate, and superior performances compared to existing solutions with an optimal training time and memory requirement.

Furthermore, CPADF outperforms existing approach in terms of exploring and detecting sophisticated scenarios of power overloading cyberattacks against smart grid AMI. Indeed, the consumption pattern-based anomaly detection makes CPADF able to detect grid overloading cyberattacks regardless of the attacker’s strategy. Whereas, most of the existing solutions are attacker’s strategy oriented, which may fail in detection of cyberattacks if the attacker changes its strategy.

The remainder of this paper is organized as follows. In Section 2, we summarize the state of the art in the field of anomaly detection systems developed to protect smart grid systems. In Section 3, we present the AMI network architecture. In Section 4, different types of power overloading cyberattacks are studied. Section 5 describes the CPADF framework from data collection to anomaly detection process. We evaluate the performance of CPADF in Section 6. Section 7 concludes the paper and draws some lines for future work.

2 Related work

Most of the existing works in the literature are related to fraud detection, such as electricity theft and pricing cyberattacks. In [2] and [3] the authors considered two smart home pricing cyberattacks: cyberattack for bill reduction; and cyberattack for forming a peak energy load. In the first cyberattack, the hacker attempts to fake the guideline pricing curve such that it can reduce the cost of his own bill at the cost of bill increase of other customers. The goal of the second cyberattack is to create a peak energy usage by faking the guideline pricing curve. A countermeasure technique which uses support vector regression and impact difference for detecting pricing manipulation has been proposed in [2]. The proposed system leverages the interdependence between the electricity pricing and the energy load in the power system. It detects the peak energy load by monitoring changes in the guideline pricing curve. To improve the detection system accuracy, the authors proposed in [3] a partially observable Markov decision process for modeling the long term impact of pricing cyberattacks. In [4], the authors introduced a new type of pricing cyberattack, which creates a sharp increase or decrease of the energy load, resulting in a dramatic drop of generation frequency. To tackle the scalability limitation of the system proposed in [3] and address the new pricing cyberattack, Liu et al.[4] proposed a new hierarchical framework, which models the attacking state of each smart meter in a distributed fashion. The proposed framework employs a global policy optimization algorithm to take a centralized decision on checking and repairing the compromised smart meters. In [2, 3, 4], the attacker’s objective is to make profit not to cause a blackout by overloading the grid. Although [2, 3, 4] address grid overloading cyberattack, they only consider short term load increase resulting from pricing manipulation. On the other hand, in this paper we consider different types of long term grid overloading cyberattacks. The proposed anomaly detection systems in [2, 3, 4] are only effective if the attacker creates a peak energy load by manipulating the guideline pricing curve. In this paper, we focus on detecting grid overloading cyberattacks based on consumption pattern changes, regardless of the attacker’s strategy.

Jokar et al. [1] addressed grid overloading as well as energy theft. They considered the scenario, where the attacker increases the energy load by manipulating prices or compromising the direct load control system. The authors [1] proposed two anomaly detection algorithms based on the predictability of consumption patterns of customers. In [10] Jokar et al. extended and adapted their proposition to detect only energy theft attack. They used transformer meters and anomaly detectors, as well as appropriate classification and clustering techniques, to improve the performance and the robustness of the algorithm against nonmalicious changes in consumption pattern. Classification-based methods need malicious samples to train the classifier, which might not be available, since malicious behavior might never or seldom occur for a given customer. Using synthetic malicious samples can solve the problem. However the classifier would not detect attacks that deviate significantly from the synthetic malicious samples used to train the system. In this paper, we use regression decision trees to predict the consumption profiles during a particular time slot, and then we compare the expected profile with the actual one. Our approach is capable of detecting different attack types, because it does not build the classifier using a particular type of synthetic malicious samples.

Ford et al. [17] and Cody et al. [18] also addressed grid overloading as well as energy theft. They used artificial neural networks and decision tree respectively to model the normal profile of customer’s energy consumption. Real historical data from the Irish smart energy trial [19] were used to generate the regression models and predict future energy consumption. Then, the anomaly detection systems compare the predicted value with the actual consumption to detect malicious behaviors. Although the proposed approaches [17] [18] overcome the limitations of classification-based methods, only one type of grid overloading cyberattack has been considered. The proposed systems do not monitor the consumption pattern at the neighborhood level. Monitoring pattern change at the neighborhood level improves the detection accuracy and reduces the detection delay, since the load increase is more noticeable at the neighborhood level than for a single customer or group of customers, particularly at the beginning of the attack. Important factors such as memory requirement and processing time have not been considered. Using more attributes to model the consumption pattern, CPADF provides a better prediction with lower error rates. CPADF shows good performance in terms of memory requirement and processing time. Our tests show that CPADF outperforms the anomaly detection systems proposed in [17] [18].

Faisal et al. [20] proposed a new intrusion detection system (IDS) architecture for the whole AMI system at the levels of smart meter, data concentrator, and headend. A feasibility analysis of the application of several data stream mining algorithms has been conducted to select the best algorithm for each AMI component. In [21], Zhang et al. proposed a distributed intrusion detection system for smart grids (SGDIDS) with a hierarchical three layer structure. The proposed IDS analyzes communication traffic using classification algorithms such as support vector machine (SVM) and artificial immune system (AIS). The proposed systems in [21] and [20] have been validated on the widely used public KDD Cup 1999 dataset [22]. However, this dataset was designed for intrusion detection in computer networks, the considered attacks are based on communication scenarios. The dataset did not consider characteristics inherent to the smart grid infrastructure and attack scenarios against AMI transactions. Furthermore, the KDD dataset [22] has a huge number of redundant records and biased distribution of attacks.

In [23] an optimal strategy of on-site investigation and monitoring verification for potential anomalies and malware is proposed. Using the decision process framework of Markovian, and based on the observation from the deployed anomaly detectors, the proposed framework determines the best inspection strategies. Alcarez et al. [24] examined key security aspects of the Open Charge Point Protocol (OCPP) for communication between electric vehicle, charging points and central management system. The paper shows how a hacker can exploit OCPP vulnerabilities to carry out attacks to burden resource reservation related to electric vehicle, steal energy, or overload the grid. For instance, an attacker might inject forged OCPP transaction to destabilize network or to affect its functioning. In [25] the authors analyzed a set of existing anomaly detection approaches which use machine learning, knowledge and statistical detection-based techniques, and information and spectral theory. The authors investigated the functionalities of the detection approaches for context-awareness in smart grid environments. The paper provides a guideline regarding the choice of the most suitable schemes and detection modes. The suitability is examined based on the restrictions of the context and functional characteristics of the technologies and communication systems. In section 6.4, we show the suitability of CPADF to the smart grid context according to the set of requirements specified in [25].

3 AMI network architecture

The smart home (SH) constitutes an integral part of the smart grid AMI, it leverages sensors and networking technologies to be in continuous interaction with its internal and external environments. The Energy Services Interface (ESI) represents the interface connecting the SH to the smart grid. Although there is a logical separation between the smart meter and ESI, their functionalities are generally integrated into one physical device (generally the smart meter) for cost effectiveness. The ESI has diverse functionalities, such as remote control of devices, transmission of consumption data to the utility, supervising of Distributed Energy Resources such as wind turbines, the management of demand response programs, Plug in Electric / Plugin Electric Hybrid Vehicles (PEV/PHEV) charging etc. The Energy Management System (EMS) represents the entity responsible for managing diverse appliances and systems within the SH. It enables the SH to adjust its energy consumption to suit the grid’s capacities. The EMS enables the management of high consuming appliances such as air conditioning system, and offers the remote configuration of the smart home devices [26]. Figure 1 shows the different entities of the AMI network architecture. The connections to the ESI are represented by the green dot-dashed lines, whereas the red-dotted lines represent the connections to the EMS. The communication between the smart home and the AMI infrastructure is represented by the blue dashed line. The EMS and ESI are in constant two-way communication to manage the internal environment in coherence with the external environment requirements and capabilities [26]. The Home Area Network (HAN) interconnects appliances with ESI/smart meters and EMS. The Neighborhood Area Network (NAN) represents the network interconnecting the smart meters with the data concentrator. The Wide Area Networks (WAN) interconnects multiple NANs to the Utility headend.

Refer to caption
Figure 1: AMI network architecture

4 Power overloading cyberattacks

In this section we present three types of power overloading cyberattacks which exploit the vulnerability of load control systems (such as smart home scheduling systems), and the vulnerability of OCPP protocol. The goal of load control systems is balancing supply and demand to ensure a reliable grid operation. Indirect load control (ILC) mechanisms use dynamic pricing to incite customers to adapt their consumption profiles to suit the grid capabilities. There are two dynamic pricing models, which are usually used together. The first model called real time pricing, where the price is set based on the energy consumption in the local community. The second one is the guideline pricing, where the utility predicts the future load, sets a predictive pricing curve, and uses it for guiding the customers on energy scheduling. The Direct Load Control mechanisms (DLC) allow the utility to directly control the customers’ loads by sending control signals such as turn on/off, through AMI. The OCPP is an application protocol for communication between electric vehicle and charging point and a central management system. One advantage of the introduction of electric vehicles into smart grids, is their bidirectional charging which allows local and global smoothing of imbalances and load peaks. Alcaraz et al. [24] studied attacks that misuse the OCPP protocol to destabilize power networks and interfere with resource reservation initiated with the electric vehicle. Although the paper provides divers threat scenarios related to the logical functionality of the OCPP at different stages, in this paper, we consider power overloading scenario at transactions and control stage.

Refer to caption
Figure 2: Power overloading cyberattacks

4.1 Cyberattacks against ILC

By manipulating the pricing curve, an attacker can modify the consumption profile of the customers and the whole neighborhood consumption profile. The attacker can either compromise the communication infrastructure or hack the substation, and then send fake pricing information to the local community. The attacker can also use a malware to compromise the smart meter and then modify the received pricing information (see figure 2). He can then scale that up as much as he may take control of thousands of smart homes, depending on the propagation of the malware [23]. In this paper we consider the two following cyberattacks against ILC mechanisms, called pricing cyberattacks [2] as follows.

4.1.1 Cyberattack for bill reduction

The attacker manipulates the guideline pricing curve, in such a way that the electricity price is high during a particular time slot. This will dissuade the other customers to schedule energy consumption during this time slot. Thus, this reduces the local community energy load during this time slot, resulting in the decrease of the real time electricity price there. Afterward, the attacker could schedule the energy consumption during this time slot, and makes profit through reducing his own bill at the cost of bill increase of other customers [2].

4.1.2 Cyberattack for forming a peak energy load

The attacker first identifies peak consumption energy hours, and then he manipulates the guideline pricing curve such that it is very low during peak energy consumption hours. Therefore, the customers will schedule their large controllable high consumption appliances during peak usage energy hours. This will form a peak in energy consumption leading to significant disturbance in the power system. Also increasing energy load fluctuation could significantly impacts the power system dynamics and changes the generation frequency dramatically. The attacker could increase the energy load fluctuation by manipulating the guideline prices such that it is very high during a time slot then it is very low during the next one, the shorter the time slot is, the higher load fluctuation would be [2] [4].

4.2 Cyberattacks against DLC

The attacker compromises the EMS to send fake “turn on/off” signal ordering a large number of appliances within the premises to get switched on [1]. For instance, the attacker can create a surge by turning air conditioners on during peak usage energy periods such as extreme cold/heat or during peak usage hours of the day. Also in this case, the attacker can increase the energy load fluctuation by repeatedly sending turn on/off signals to a large number of appliances, particularly the high consumption ones, such as air conditioning. This will create disturbances and imbalances in the grid that could stumble breakers beyond the targeted neighborhood and cause a large area blackout. Table 1 summaries the characteristics of power overloading cyberattacks against load control mechanisms, and shows the anomalous consumption pattern changes.

4.3 Cyberattack against OCPP

It has been shown in [24] that an attacker may damage the energy safety if the communication channels are intercepted, and the security credentials of an OCPP user/object is known. A hacker might carry out several attacks such as: denial of power resources and services, energy theft, and power overload. As mentioned previously, in this paper, we are interested in power overloading scenario. In smart grids, the majority of charging points are configured to provide bidirectional interfaces for power charging/discharging, so that batteries discharge during peak periods and charge during off-peak times. The central management system defines the charging profiles which specify the amount of power that can be supplied per time interval to one or multiple points of charge with their charging schedules. To increase power demand at peak periods, the attacker alters the charging profiles, in such a way that the intensity of Wh has to be greater at peak hours or equal to the power consumption in off-peak periods. The fake charging profiles are then used, so that multiple compromised points of charge inject energy into electric vehicle during peak periods.

Table 1: Characteristics of power overloading cyberattacks

Cyberattacks Load control mechanisms Time slots Usage patterns Bill reduction ILC Random Decrease of the whole neighborhood consumption Increase of N compromised smart home consumptions Forming peak energy load ILC/DLC Peak hours Increase of the whole neighborhood consumption Increasing load fluctuation ILC/DLC Random and short Succession of energy consumption increase and decrease

5 CPADF Framework

In this section, the CPADF is described. Firstly, data collection and attributes extraction are described. Next, regression algorithms used for consumption pattern modeling are presented. Lastly, the anomaly detection processes at smart home and neighborhood level are described in details. Hereinafter, we use the abbreviations SH to refer to smart home, and NBH to refer to neighborhood.

5.1 Data collection and attributes extraction

Data collection and training process of consumption prediction models for SH and NBH anomaly detectors are illustrated in figure 3. Firstly, metering data are collected from each SH, and from transformer meter. Then, a dataset is generated for each SH, also NBH dataset including the whole neighborhood half hourly consumption is generated. Each data vector within SH and NBH datasets includes the electricity consumption along with a set of time and seasonal related attributes extracted from the raw data (time stamp and consumption). We consider the following attributes: time, day period (day/night), day type (weekday/weekend), month, and season. These attributes are used to allow predicting electricity consumption. For instance, if we consider the attribute day period, most often, electricity consumption tends to decrease during night due to the decline of human activities. The day type attribute allows catching legitimate consumption pattern changes related to the customer activity. For instance, the consumption on the weekend may drop considerably if the customer usually leaves his/her place for some vacations. In contrast, if the customer stays at home, he/she may consume more electricity than usual by spending more time using entertainment devices such as video games, TV, PC, etc. It is important to underline that electricity usage is directly or indirectly affected by external conditions, particularly by the seasonal conditions as weather and temperature. In winter, the weather is cold and dark, people tend to stay at home, and thus consume more electricity on lighting and heating. While in summer, the weather is sunny and hot, people tend to be out to enjoy sunny weather, so that their electricity consumption decreases. Also, the month attribute needs to be considered, because even within the same season two months could have different consumption pattern, such as September/ December or January/ March. For instance, the consumption pattern of the 1st of October differs from the one of the 21th of December, due to several factors such as: the number of daytime hours and temperature. The NBH prediction model is trained using periodic NBH global consumption calculated and sent by the transformer meter. The SH/NBH historical electricity consumption data are used to model and predict future electricity consumption.The machine learning algorithms used to model consumption pattern are described in the subsequent section 5.2. The SH prediction models are trained within the data concentrator to overcome resource limitation within the ESI. Abnormal consumption samples flagged as suspect by SH anomaly detector are transferred to the data concentrator.

Refer to caption
Figure 3: Data collection, traning and anomaly detection process

5.2 Modeling consumption patterns

To model SH/ NBH electricity consumption pattern so we can predict consumption at any time of the day, five algorithms of supervised machine learning have been used. These algorithms are selected for their known performance and low prediction error rate. The following gives brief description of the machine learning algorithms used in this paper.

5.2.1 REPTree

REPTree (Reduced Error Pruning Tree) is a fast decision tree learner that uses information gain ratio (Formula (2)) as splitting criterion, where D𝐷Ditalic_D is the whole dataset, m𝑚mitalic_m is the number of classes, pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the frequency of class i𝑖iitalic_i in the dataset, K𝐾Kitalic_K is the number of subsets generated by the split [27].

info(D)=i=1mpilog2(pi)𝑖𝑛𝑓𝑜𝐷superscriptsubscript𝑖1𝑚subscript𝑝𝑖subscript2subscript𝑝𝑖info(D)=\sum\limits_{i=1}^{m}-{{p_{i}}}{\log_{2}}({p_{i}})italic_i italic_n italic_f italic_o ( italic_D ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)
GainRatio(A)=info(D)j=1m|Dj||D|×info(Dj)j=1k|Dj||D|×log2(|Dj||D|)𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜𝐴𝑖𝑛𝑓𝑜𝐷superscriptsubscript𝑗1𝑚subscript𝐷𝑗𝐷𝑖𝑛𝑓𝑜subscript𝐷𝑗superscriptsubscript𝑗1𝑘subscript𝐷𝑗𝐷subscript2subscript𝐷𝑗𝐷GainRatio(A)=\frac{{info(D)-\sum\limits_{j=1}^{m}{\frac{{\left|{{D_{j}}}\right% |}}{{\left|D\right|}}}\times info({D_{j}})}}{{\sum\limits_{j=1}^{k}{\frac{{% \left|{{D_{j}}}\right|}}{{\left|D\right|}}}\times{\log_{2}}\left({\frac{{\left% |{{D_{j}}}\right|}}{{\left|D\right|}}}\right)}}italic_G italic_a italic_i italic_n italic_R italic_a italic_t italic_i italic_o ( italic_A ) = divide start_ARG italic_i italic_n italic_f italic_o ( italic_D ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG × italic_i italic_n italic_f italic_o ( italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG × roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG ) end_ARG (2)

5.2.2 M5P

M5P combines decision tree and linear regression, it uses Standard deviation (SD) to determine the best attribute for splitting the dataset at each node [27]. The attribute to be chosen is the one that maximizes the error reduction (Formula (3)).

Δerror=SD(S)i=1m(|Di||D|SD(Di))Δ𝑒𝑟𝑟𝑜𝑟𝑆𝐷𝑆superscriptsubscript𝑖1𝑚subscript𝐷𝑖𝐷𝑆𝐷subscript𝐷𝑖\Delta error=SD(S)-\sum\limits_{i=1}^{m}\left(\frac{|D_{i}|}{|D|}SD(D_{i})\right)roman_Δ italic_e italic_r italic_r italic_o italic_r = italic_S italic_D ( italic_S ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG italic_S italic_D ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (3)

5.2.3 Random Forest

Random forest [28] is a combination of unpruned regression trees, it uses random feature selection in the tree induction process. The forest averages the prediction outputs returned by the individual trees.

5.2.4 Artificial Neural Network

An artificial neural network is computational system consisting of interconnected simple elements called neurons, which produce output depending on one or more inputs and an activation function e.g., sigmoid function, hyperbolic, etc. Where φ𝜑\varphiitalic_φ in Formula (4) represents the activation function that determines the output value o𝑜oitalic_o according to the values of entries e𝑒eitalic_e and their weights w𝑤witalic_w [27].

o=φ(i=0i=Pwiei)𝑜𝜑superscriptsubscript𝑖0𝑖𝑃subscript𝑤𝑖subscript𝑒𝑖o=\varphi(\sum\limits_{i=0}^{i=P}{{w_{i}}{e_{i}})}italic_o = italic_φ ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i = italic_P end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (4)

5.2.5 SVM

SVM is a supervised learning model used for classification and regression problems. For regression, SVM uses ε𝜀\varepsilonitalic_ε the insensitive loss function that penalizes error only if it is greater than ε𝜀\varepsilonitalic_ε [29]. Therefore, the |ξ|εsubscript𝜉𝜀\left|\xi\right|_{\varepsilon}| italic_ξ | start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is represented as:

|ξ|ε={0if|ξ|ε|ξ|εotherwise.subscript𝜉𝜀cases0𝑖𝑓𝜉𝜀𝜉𝜀𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒{\left|\xi\right|_{\varepsilon}}=\left\{\begin{array}[]{l}0\,\,\,\,if\,\,\,% \left|\xi\right|\leq\varepsilon\\ \left|\xi\right|-\varepsilon\,\,\,\,otherwise.\end{array}\right.| italic_ξ | start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 0 italic_i italic_f | italic_ξ | ≤ italic_ε end_CELL end_ROW start_ROW start_CELL | italic_ξ | - italic_ε italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e . end_CELL end_ROW end_ARRAY

Using (non-negative) slack variables ξisubscript𝜉𝑖{{\xi_{i}}}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ξisuperscriptsubscript𝜉𝑖{{\xi_{i}}^{*}}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the final optimization problem to be solved can be formulated as follows:

Minimize12W2+Ci=1(ξi+ξi)Subjectedto:yif(xi,w)εξif(xi,w)yiεξiξi,ξi0,i=1,,n:𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒12superscriptdelimited-∥∥𝑊2𝐶subscript𝑖1subscript𝜉𝑖superscriptsubscript𝜉𝑖𝑆𝑢𝑏𝑗𝑒𝑐𝑡𝑒𝑑𝑡𝑜subscript𝑦𝑖𝑓subscript𝑥𝑖𝑤𝜀superscriptsubscript𝜉𝑖𝑓subscript𝑥𝑖𝑤subscript𝑦𝑖𝜀superscriptsubscript𝜉𝑖formulae-sequencesubscript𝜉𝑖superscriptsubscript𝜉𝑖0𝑖1𝑛\begin{split}Minimize{\kern 1.0pt}\,\,\frac{1}{2}{\left\|W\right\|^{2}}+C\sum% \limits_{i=1}{\left({{\xi_{i}}+{\xi_{i}}^{*}}\right)}\\ Subjected\,to:\hskip 99.58464pt\\ \begin{array}[]{l}{y_{i}}-f({x_{i}},w)\leq\varepsilon-{\xi_{i}}^{*}\\ f({x_{i}},w)-{y_{i}}\leq\varepsilon-{\xi_{i}}^{*}\\ {\xi_{i}},{\xi_{i}}^{*}\geq 0,\,i=1,...,n\end{array}\end{split}start_ROW start_CELL italic_M italic_i italic_n italic_i italic_m italic_i italic_z italic_e divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_W ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_S italic_u italic_b italic_j italic_e italic_c italic_t italic_e italic_d italic_t italic_o : end_CELL end_ROW start_ROW start_CELL start_ARRAY start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w ) ≤ italic_ε - italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_ε - italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ 0 , italic_i = 1 , … , italic_n end_CELL end_ROW end_ARRAY end_CELL end_ROW (5)

Where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a n-dimensional vector, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the target, w is the weight vector, C represents the penalty for the error term. SVM regression finds the linear regression in the high-dimension feature space using ε𝜀\varepsilonitalic_ε while reducing the model complexity by minimizing W2superscriptnorm𝑊2{\left\|W\right\|^{2}}∥ italic_W ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

The performance evaluation of these algorithms is discussed in the next section (Tables 3,4).

Refer to caption
Figure 4: CPADF workflow

5.3 Anomaly detection

The CPADF two-level monitoring architecture operates within the two main components of the AMI: the ESI, and the data concentrator, as illustrated in Figure 4. Periodic metering data from SH are inserted into the SH preprocessing module, which is responsible for attributes extraction and data preprocessing. Since the NBH consumption depends on the consumption of all SHs in the neighborhood, the same set of time related and seasonal attributes are extracted from NBH metering data. The expected consumption calculated by the prediction model, is then compared with the received consumption value.

After preprocessing the SH metering data into the appropriate format consistent with the SH training set, the SH regression model predicts the SH electricity consumption for the given attributes vector. If within N𝑁Nitalic_N successive time intervals the number of times an anomaly (consumption increase) is detected with a certain threshold (Nbrincr𝑁𝑏subscript𝑟𝑖𝑛𝑐𝑟Nbr_{incr}italic_N italic_b italic_r start_POSTSUBSCRIPT italic_i italic_n italic_c italic_r end_POSTSUBSCRIPT), then an anomaly is reported and the suspect sample is sent to the data concentrator. Otherwise, the processed sample is added to the benign dataset which is periodically transferred to data concentrator for periodic retraining. The threshold (Nbrincr𝑁𝑏subscript𝑟𝑖𝑛𝑐𝑟Nbr_{incr}italic_N italic_b italic_r start_POSTSUBSCRIPT italic_i italic_n italic_c italic_r end_POSTSUBSCRIPT) specifies the number of tolerable successive abnormal consumption increase. This threshold is used to mitigate false alerts caused by occasional legitimate consumption increases. An unusual legitimate consumption increase may be caused by operating one or multiple high energy consumption appliances (washing machine, dishwasher, vacuum cleaning, oven, etc) out of their usual operating time. On average, the length of use of such appliances is between 40 and 120 minutes. For instance, a cycle of washing machine lasts on average between 20 and 60 minutes, most dishwashers cycles are about 2 hours. Thus, the threshold is set to 2 (which corresponds to two successive time intervals of 1 hour). The threshold is set and updated by the system administrator.

RMSE=1nj=1n(𝑦jy^j)2𝑅𝑀𝑆𝐸1𝑛superscriptsubscript𝑗1𝑛superscriptsubscript𝑦𝑗subscript^𝑦𝑗2RMSE=\sqrt{\frac{1}{n}\sum_{j=1}^{n}(\mathop{y}\nolimits_{j}-\mathop{\hat{y}}% \nolimits_{j})^{2}}italic_R italic_M italic_S italic_E = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - start_BIGOP over^ start_ARG italic_y end_ARG end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6)

The prediction root mean squared error (RMSE) is used as prediction error (PE) for both SH and NBH anomaly detection algorithms. RMSE measures the square root of the average of squared differences between the prediction (y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG) and the actual observation (y𝑦yitalic_y) (see equation (6)). Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. By penalizing large errors, the RMSE value increases with the variance of the frequency distribution of error magnitudes. Taking RMSE as prediction error to calculate the threshold, allows to mitigate the detection of legitimate consumption increases as anomalous. The prediction models are retrained when benign datasets are large enough in terms of instances. Periodic retraining allows CPADF to adapt to consumption pattern changes related to nonmalicious factors such as changes of residents or appliances, etc. Pseudocodes of the SH anomaly detection algorithm is provided in Algorithm 1.

1 BEGIN
2Input: SHmeter𝑆subscript𝐻𝑚𝑒𝑡𝑒𝑟SH_{meter}italic_S italic_H start_POSTSUBSCRIPT italic_m italic_e italic_t italic_e italic_r end_POSTSUBSCRIPT (SH metering data) ;
3 Output: anomaly (boolean);
4
5Variables: t (time interval), PredSH𝑃𝑟𝑒subscript𝑑𝑆𝐻Pred_{SH}italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT (SH prediction model), SHCons𝑆subscript𝐻𝐶𝑜𝑛𝑠SH_{Cons}italic_S italic_H start_POSTSUBSCRIPT italic_C italic_o italic_n italic_s end_POSTSUBSCRIPT (SH observed consumption), counter (number of times an increase is detected),SHPE𝑆subscript𝐻𝑃𝐸SH_{PE}italic_S italic_H start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT (prediction error), Nbrincr𝑁𝑏subscript𝑟𝑖𝑛𝑐𝑟Nbr_{incr}italic_N italic_b italic_r start_POSTSUBSCRIPT italic_i italic_n italic_c italic_r end_POSTSUBSCRIPT (threshold of successive consumption increase), BDSH𝐵subscript𝐷𝑆𝐻BD_{SH}italic_B italic_D start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT (SH benign dataset) ;
6
7for t{1,,24}𝑡124t\in\{1,...,24\}italic_t ∈ { 1 , … , 24 } do
8      
9      Calculate attributes vector NSSH𝑁subscript𝑆𝑆𝐻NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT from SHmeter𝑆subscript𝐻𝑚𝑒𝑡𝑒𝑟SH_{meter}italic_S italic_H start_POSTSUBSCRIPT italic_m italic_e italic_t italic_e italic_r end_POSTSUBSCRIPT
10 end for
11
12SHPC=PredSH𝑆subscript𝐻𝑃𝐶𝑃𝑟𝑒subscript𝑑𝑆𝐻SH_{PC}=Pred_{SH}italic_S italic_H start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT = italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT(NSSH𝑁subscript𝑆𝑆𝐻NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT);
13
14if ( SHCons>SHPC+SHPE𝑆subscript𝐻𝐶𝑜𝑛𝑠𝑆subscript𝐻𝑃𝐶𝑆subscript𝐻𝑃𝐸SH_{Cons}>SH_{PC}+SH_{PE}italic_S italic_H start_POSTSUBSCRIPT italic_C italic_o italic_n italic_s end_POSTSUBSCRIPT > italic_S italic_H start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT + italic_S italic_H start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT) then
15      if (counter>Nbrincr𝑐𝑜𝑢𝑛𝑡𝑒𝑟𝑁𝑏subscript𝑟𝑖𝑛𝑐𝑟counter>Nbr_{incr}italic_c italic_o italic_u italic_n italic_t italic_e italic_r > italic_N italic_b italic_r start_POSTSUBSCRIPT italic_i italic_n italic_c italic_r end_POSTSUBSCRIPT) during the last N time intervals) then
16             anomaly=true ;
17             Send alert to decision maker ;
18             Transfer the suspect sample to the data concentrator ;
19            
20      else
21             counter++;
22             Add NSSH𝑁subscript𝑆𝑆𝐻NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT to BDSH𝐵subscript𝐷𝑆𝐻BD_{SH}italic_B italic_D start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT ;
23            
24       end if
25      
26else
27      
28      Add NSSH𝑁subscript𝑆𝑆𝐻NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT to BDSH𝐵subscript𝐷𝑆𝐻BD_{SH}italic_B italic_D start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT;
29      
30 end if
31
32Periodic transfer of BDSH𝐵subscript𝐷𝑆𝐻BD_{SH}italic_B italic_D start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT to the data concentrator ;
END
Algorithm 1 SH Anomaly detection algorithm

At the SH level, the periodic consumption monitoring is set to 1 hour, because humans typically operate on hour interval, therefore it is difficult to notice pattern change over a smaller time interval. However, consumption pattern changes of a large number of SHs even over a shorter period of time results in drastic consumption pattern change of the whole NBH. Therefore, at the NBH level, the periodic consumption monitoring is set to the smart meter data collection frequency (30 minutes) to provide the minimal detection delay. The NBH total electricity is measured by the transformer meter. After calculating and preprocessing the NBH attributes vector, this latter is given to the NBH regression model to predict consumption. If the received consumption is larger than the sum of the predicted consumption and the prediction error (RMSE), then a Neighborhood Abnormal Consumption Raise (NACR) is detected, the suspect sample is stored, and an alert is sent to the decision maker. Otherwise, the processed sample is added to the benign dataset (BDNBH𝐵subscript𝐷𝑁𝐵𝐻BD_{NBH}italic_B italic_D start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT) for the periodic retraining of NBH consumption profile. Pseudocodes of the NBH anomaly detection algorithm is provided in Algorithm 2.

1 BEGIN
2Input: NBHmeter𝑁𝐵subscript𝐻𝑚𝑒𝑡𝑒𝑟NBH_{meter}italic_N italic_B italic_H start_POSTSUBSCRIPT italic_m italic_e italic_t italic_e italic_r end_POSTSUBSCRIPT (NBH metering data)
3Output: NACR (boolean)
4Variables: t (time interval), PredNBH𝑃𝑟𝑒subscript𝑑𝑁𝐵𝐻Pred_{NBH}italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT (NBH prediction model),NBHPE𝑁𝐵subscript𝐻𝑃𝐸NBH_{PE}italic_N italic_B italic_H start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT (prediction error), NBHCons𝑁𝐵subscript𝐻𝐶𝑜𝑛𝑠NBH_{Cons}italic_N italic_B italic_H start_POSTSUBSCRIPT italic_C italic_o italic_n italic_s end_POSTSUBSCRIPT (NBH observed consumption), BDNBH𝐵subscript𝐷𝑁𝐵𝐻BD_{NBH}italic_B italic_D start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT (NBH benign Dataset)
5for t{1,,48}𝑡148t\in\{1,...,48\}italic_t ∈ { 1 , … , 48 } do
6       Calculate attributes vector NSNBH𝑁subscript𝑆𝑁𝐵𝐻NS_{NBH}italic_N italic_S start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT from NBHmeter𝑁𝐵subscript𝐻𝑚𝑒𝑡𝑒𝑟NBH_{meter}italic_N italic_B italic_H start_POSTSUBSCRIPT italic_m italic_e italic_t italic_e italic_r end_POSTSUBSCRIPT ;
7       NBHPC=PredNBH𝑁𝐵subscript𝐻𝑃𝐶𝑃𝑟𝑒subscript𝑑𝑁𝐵𝐻NBH_{PC}=Pred_{NBH}italic_N italic_B italic_H start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT = italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT(NSNBH𝑁subscript𝑆𝑁𝐵𝐻NS_{NBH}italic_N italic_S start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT);
8      
9      if (NBHCons>NBHPC+NBHPE)𝑁𝐵subscript𝐻𝐶𝑜𝑛𝑠𝑁𝐵subscript𝐻𝑃𝐶𝑁𝐵subscript𝐻𝑃𝐸(NBH_{Cons}>NBH_{PC}+NBH_{PE})( italic_N italic_B italic_H start_POSTSUBSCRIPT italic_C italic_o italic_n italic_s end_POSTSUBSCRIPT > italic_N italic_B italic_H start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT + italic_N italic_B italic_H start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT ) then
10            
11            NACR=true ;
12             Send alert to the decision maker ;
13             Store the suspect sample ;
14            
15      else
16            
17            NACR=false ;
18             Add NSNBH𝑁subscript𝑆𝑁𝐵𝐻NS_{NBH}italic_N italic_S start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT to BDNBH𝐵subscript𝐷𝑁𝐵𝐻BD_{NBH}italic_B italic_D start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT;
19            
20       end if
21      
22 end for
23Periodic training of PredNBH𝑃𝑟𝑒subscript𝑑𝑁𝐵𝐻Pred_{NBH}italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT ;
END
Algorithm 2 NBH Anomaly Detection Algorithm

The decision maker module confirms the anomaly and notifies the operator in two cases: 1) more than half of SH anomaly detection systems in the NBH report an anomaly; 2) when a neighborhood abnormal consumption raise is reported. The first case corresponds to bill reduction cyberattack, the second one corresponds to forming peak energy load cyberattack. Then, the utility headend checks whether the detected anomaly is caused by a cyberattack or it is related to a temporary pattern change such as special occasions. After decision, the appropriate response is triggered, the attack or normal samples are stored into either attack or benign datasets. Initially, attack datasets are empty unless external sources are used. Malicious samples classified by the decision maker will be added to the attack datasets. Once the two datasets are large enough, they will be used to build new classifiers for SH and NBH. These classifiers will constitute a second detection level and a decision support system for the utility headend. This approach allows for overcoming issues related to using synthetic malicious samples to train the classifiers. If NACR was detected, but no anomalies have been reported, it appears that an attack might be occurring but the SH anomaly detection system cannot recognize it. In this situation, the SH dataset is analyzed for sign of gradual overloading cyberattack, in which the attacker gradually increases the consumption data to mislead the learning machine to consider a malicious pattern as a normal one. The long-term tendency in daily consumption (historical data) of the smart home is analyzed. A gradual overloading can be characterized by an ascending slope in long-term consumption curve. Pseudocodes of the decision making algorithm is provided in Algorithm 3.

1 BEGIN
2Input: NACR (boolean), Nbalert𝑁subscript𝑏𝑎𝑙𝑒𝑟𝑡Nb_{alert}italic_N italic_b start_POSTSUBSCRIPT italic_a italic_l italic_e italic_r italic_t end_POSTSUBSCRIPT
3Output: attack (boolean)
4Variables: NbSH𝑁subscript𝑏𝑆𝐻Nb_{SH}italic_N italic_b start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT (number of SH in the neighborhood)
5for t{1,,48}𝑡148t\in\{1,...,48\}italic_t ∈ { 1 , … , 48 } do
6      
7      if (NACR==true)||(Nbalert>12NbSH(NACR==true)||(Nb_{alert}>\frac{1}{2}*Nb_{SH}( italic_N italic_A italic_C italic_R = = italic_t italic_r italic_u italic_e ) | | ( italic_N italic_b start_POSTSUBSCRIPT italic_a italic_l italic_e italic_r italic_t end_POSTSUBSCRIPT > divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∗ italic_N italic_b start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT) then
8             attack=true ;
9             Send an alert to the operator ;
10            
11      else
12             Add NSNBH/NSSH𝑁subscript𝑆𝑁𝐵𝐻𝑁subscript𝑆𝑆𝐻NS_{NBH}/NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT / italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT to BDNBH/BDSH𝐵subscript𝐷𝑁𝐵𝐻𝐵subscript𝐷𝑆𝐻BD_{NBH}/BD_{SH}italic_B italic_D start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT / italic_B italic_D start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT ;
13            
14       end if
15      
16 end for
17
18if (attack is confirmed) then
19       Add NSNBH/NSSH𝑁subscript𝑆𝑁𝐵𝐻𝑁subscript𝑆𝑆𝐻NS_{NBH}/NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT / italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT to the attack datasets ;
20       Initiate a response ;
21      
22else
23       Add NSNBH/NSSH𝑁subscript𝑆𝑁𝐵𝐻𝑁subscript𝑆𝑆𝐻NS_{NBH}/NS_{SH}italic_N italic_S start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT / italic_N italic_S start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT to BDNBH/BDSH𝐵subscript𝐷𝑁𝐵𝐻𝐵subscript𝐷𝑆𝐻BD_{NBH}/BD_{SH}italic_B italic_D start_POSTSUBSCRIPT italic_N italic_B italic_H end_POSTSUBSCRIPT / italic_B italic_D start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT ;
24      
25 end if
26
END
Algorithm 3 Decision Making Algorithm
Table 2: Raw data file format
Meter ID Encoded date/time Energy Consumption (KWh)
1392 19535 0.256
1392 19538 0.265
1951 19604 0.042
1951 19605 0.021

6 Experimental results

We used in our experimentation the smart meter energy consumption data from the Irish Smart Energy Trial [19], the dataset was released by SEAI in January 2012. The dataset has been created within Smart Metering Electricity Customer Behaviour Trials (CBTs) which has taken place from 2009 to 2010. The purpose of the trials was to assess the impact on consumer’s electricity consumption in order to inform the cost-benefit analysis for a national rollout. The dataset contains the energy consumption data of over 5000 residential households and businesses [19]. The dataset is constituted of six data files with millions of entries per file. Each data file contains three columns, the first column indicates the smart meter ID which identifies a particular resident or business. The second column represents timestamps corresponding to the time and date of the meter reading. Digits 1-3 represents the day code (day 1 = 1st January 2009), time code is represented by digits 4-5 (1-48 for each 30 minutes with 1= 00:00:00 – 00:29:59). The third column indicates the energy consumption value in kilowatt-hours (kWh). Table 2 shows a small sample of the raw data.

6.1 Datasets generation and preprocessing

The raw dataset includes the energy consumption data of all customers. To model each customer’s consumption pattern separately, the raw consumption data are split by meter ID into a collection of consumption datasets. For each customer dataset, a set of attributes are generated. Each vector in the new dataset includes the following attributes: SH consumption per hour; hour (1, …, 24); day type (weekday or weekend); month and season. Among four consecutive weeks one week is randomly chosen for the validation set and the other 3 weeks for the training set. Thus we use 75 % of the dataset for training and 25 % for validation. Likewise, from the raw data the NBH dataset is generated. Each vector in NBH dataset consists of the half hourly consumption of the whole NBH (the summation of meter reading of all customers within the neighborhood) and the same attributes used for SH datasets: hour, day type, day period, month and season. Among four consecutive weeks one week is randomly chosen for the validation set and the other 3 weeks for the training set.

Refer to caption
Figure 5: Datasets generation process

Data preprocessing includes operations such as cleaning and normalization. The cleaning task consists in identifying missing values and eliminating outliers and extreme values. Outliers and extreme values such as peak energy consumption may correspond to unusual activities such as holidays or special occasions. The mean and standard deviation σ𝜎\sigmaitalic_σ for each time interval within each month is calculated. All consumption values that do not lie within three σ𝜎\sigmaitalic_σ of the mean are removed from the dataset. Figure 5 summarizes the datasets generation process. As indicated previously, different time intervals have been used for SH and NBH datasets because at SH level it is difficult to notice pattern change over a small time interval, humans typically operate on hour interval. However, at NBH level the pattern change can be noticed.

6.2 Energy consumption prediction

We have used Weka [30] in our experiment, it is a collection of open source machine learning algorithms for data mining tasks. The performance of each of the five algorithms discussed in the previous section is measured in terms of the following metrics:

  1. 1.

    Mean Absolute Error (MAE): measures the average absolute differences between the predicted value and the actual value in the validation dataset (see Equation (7)).

  2. 2.

    Root Mean Squared Error (RMSE): measures the square root of the average of squared differences between the predicted value and the actual value in the validation dataset (explained previously) (6)).

  3. 3.

    Running time (in seconds): the time taken to build the model

  4. 4.

    Model size (KB): the size of the prediction model in kilobytes

MAE=1nj=1n|𝑦jy^j|𝑀𝐴𝐸1𝑛superscriptsubscript𝑗1𝑛subscript𝑦𝑗subscript^𝑦𝑗MAE=\frac{1}{n}\sum_{j=1}^{n}\left|\mathop{y}\nolimits_{j}\right.-\left.% \mathop{\hat{y}}\nolimits_{j}\right|italic_M italic_A italic_E = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - start_BIGOP over^ start_ARG italic_y end_ARG end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | (7)

We can see from the results of the energy consumption prediction of 500 customers in Table 3, that M5P algorithm gives the smallest average error rate within a reasonable running time and with low memory requirement. Therefore, M5P constitutes the best algorithm to use for SH energy prediction. The MLP algorithm shows the highest error rates and the longest running time. The random forest presents a huge memory usage in comparison with the other algorithms. Concerning the NBH energy prediction, table 4 shows that REPTree algorithm provides the best trade-off between error rates, running time and memory requirement. Therefore, we choose REPTree algorithm for NHB energy prediction, and M5P algorithm for SH energy prediction.

Table 3: Performances of regression algorithms on SHs data
Algorithm MAE RMSE Running Time (s) Model Size (KB)
REPTree 0.395 0.534 0.123 9.979
M5P 0.329 0.453 0.724 13.273
RandomForest 0.403 0.549 3.366 3,810.904
SVM 0.372 0.508 6.132 110.735
MLP 0.426 0.562 17.263 16.286
Table 4: Performances of regression algorithms on NBH data
Algorithm MAE RMSE Running Time (s) Model Size (KB)
REPTree 351,803 478,770 5,000 113,000
M5P 353,312 480,546 21,000 86,000
Random Forest 350,087 476,013 59,000 11386,000
SVM 585,336 773,824 12832,000 2709,000
MLP 451,850 582,765 246,000 15,000

6.3 Overloading cyberattacks detection

To the best of our knowledge, no real smart grid AMI transaction dataset including overloading cyberattacks data is publicly available. Thus, we simulate the power overloading cyberattacks against ILC/DLC discussed in Section 5 for 500 customers. We implement theses attacks based on datasets of normal samples, for each instance of the dataset we generate four types of malicious samples as follows (refer to table 5 for variable description):

  1. 1.

    Attack of type 1: this attack simulates forming peak energy load, where the attacker attempts to overload the grid during times of high demand when the grid becomes under pressure:

    M1(e)=e+αt𝛼t={random(0.8,4),PeakstarttPeakend1otherwisePeakhours:{79}{1922}subscript𝑀1𝑒𝑒subscript𝛼𝑡subscript𝛼𝑡cases𝑟𝑎𝑛𝑑𝑜𝑚0.84𝑃𝑒𝑎subscript𝑘𝑠𝑡𝑎𝑟𝑡𝑡𝑃𝑒𝑎subscript𝑘𝑒𝑛𝑑1𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒:𝑃𝑒𝑎𝑘𝑜𝑢𝑟𝑠791922\begin{array}[]{l}{M_{1}}(e)=e+{\alpha_{t}}\\ \mathop{\alpha}\nolimits_{t}=\left\{\begin{array}[]{l}random(0.8,4),Pea{k_{% start}}\leq t\leq Pea{k_{end}}\\ 1\;\;otherwise\end{array}\right.\\ Peak\;hours:\{7-9\}\{19-22\}\end{array}start_ARRAY start_ROW start_CELL italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_e ) = italic_e + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_r italic_a italic_n italic_d italic_o italic_m ( 0.8 , 4 ) , italic_P italic_e italic_a italic_k start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ≤ italic_t ≤ italic_P italic_e italic_a italic_k start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW end_ARRAY end_CELL end_ROW start_ROW start_CELL italic_P italic_e italic_a italic_k italic_h italic_o italic_u italic_r italic_s : { 7 - 9 } { 19 - 22 } end_CELL end_ROW end_ARRAY

    Where e is the normal consumption value and M is the modified consumption value.

  2. 2.

    Attack of type 2: this attack simulates bill reduction cyberattack, where the attacker manipulates the guideline price with a low price from timestart𝑡𝑖𝑚subscript𝑒𝑠𝑡𝑎𝑟𝑡tim{e_{start}}italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT to timeend𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑tim{e_{end}}italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT to urge customers in the community to schedule energy during this period, and a high price at other time slots during which he can schedule his energy load. Thus, the energy load increases from timestart𝑡𝑖𝑚subscript𝑒𝑠𝑡𝑎𝑟𝑡tim{e_{start}}italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT to timeend𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑tim{e_{end}}italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT.

    M2(e)=βt+eβt={(0.8,4),timestartttimeend1otherwisetimestart=random(0,23minOffTime)duration=random(minOffTime,24)timeend=timeend+durationminOffTime=4;subscript𝑀2𝑒subscript𝛽𝑡𝑒subscript𝛽𝑡cases0.84𝑡𝑖𝑚subscript𝑒𝑠𝑡𝑎𝑟𝑡𝑡𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑1𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒𝑡𝑖𝑚subscript𝑒𝑠𝑡𝑎𝑟𝑡𝑟𝑎𝑛𝑑𝑜𝑚023𝑂𝑓𝑓𝑇𝑖𝑚𝑒𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑛𝑑𝑜𝑚𝑂𝑓𝑓𝑇𝑖𝑚𝑒24𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑂𝑓𝑓𝑇𝑖𝑚𝑒4\begin{array}[]{l}{M_{2}}(e)={\beta_{t}}+e\\ {\beta_{t}}=\left\{\begin{array}[]{l}(0.8,4),\;tim{e_{start}}\leq t\leq tim{e_% {end}}\\ 1\;\;otherwise\end{array}\right.\\ tim{e_{start}}=random(0,23-\min OffTime)\\ duration=random(\min OffTime,24)\\ tim{e_{end}}=tim{e_{end}}+duration\\ \min OffTime=4;\end{array}start_ARRAY start_ROW start_CELL italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_e ) = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_e end_CELL end_ROW start_ROW start_CELL italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL ( 0.8 , 4 ) , italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT ≤ italic_t ≤ italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW end_ARRAY end_CELL end_ROW start_ROW start_CELL italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT = italic_r italic_a italic_n italic_d italic_o italic_m ( 0 , 23 - roman_min italic_O italic_f italic_f italic_T italic_i italic_m italic_e ) end_CELL end_ROW start_ROW start_CELL italic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_n = italic_r italic_a italic_n italic_d italic_o italic_m ( roman_min italic_O italic_f italic_f italic_T italic_i italic_m italic_e , 24 ) end_CELL end_ROW start_ROW start_CELL italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT = italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT + italic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_n end_CELL end_ROW start_ROW start_CELL roman_min italic_O italic_f italic_f italic_T italic_i italic_m italic_e = 4 ; end_CELL end_ROW end_ARRAY
  3. 3.

    Attack of type 3: to ensure a higher impact, the attacker may attempt to create a sharp energy increase. This attack simulates a variant of forming peak energy cyberattack. In this case, the amount of energy increase is greater than in the case of attack of type 1.

  4. 4.

    Attack of type 4: this attack simulates increasing load fluctuation cyberattack. The attacker alternates repeatedly between normal behavior and grid overloading to disturb the grid.

Refer to caption
(a) Normal consumption pattern
Refer to caption
(b) Attack consumption patterns
Figure 6: A sample one-day period of energy consumption
Table 5: Variables description

Variables Descriptions M1 Malicious consumption pattern generated through attack of type 1 M2 Malicious consumption pattern generated through attack of type 2 e𝑒eitalic_e Normal consumption Peakstart𝑃𝑒𝑎subscript𝑘𝑠𝑡𝑎𝑟𝑡Peak_{start}italic_P italic_e italic_a italic_k start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT starting time of peak hours, e.g. 7 am Peakend𝑃𝑒𝑎subscript𝑘𝑒𝑛𝑑Peak_{end}italic_P italic_e italic_a italic_k start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT ending time of peak hours, e.g., 10 pm αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the amount of electricity increase in attack of type 1 βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the amount of electricity increase in attack of type 2 timestart𝑡𝑖𝑚subscript𝑒𝑠𝑡𝑎𝑟𝑡time_{start}italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT attack of type 2 starting time timeend𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑time_{end}italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT attack of type 2 ending time minOffTime attack of type 2 minimal duration, e.g. 4 hours

Figure 6(a) shows an example of the energy consumption of a particular customer during a single day. Figure 6(b) illustrates the corresponding attack patterns.

We simulated the same four attack types on the neighborhood dataset, we implemented the attacks based on NBH dataset of normal samples. For each instance of the NBH dataset, we generated four types of malicious samples in the same way as described previously. We adjust the amount of energy consumption increase based on the neighborhood average consumption. The performance of the anomaly detection algorithms is measured in terms of the following metrics: Accuracy (AC); True Positive rate (TPR); False Positive Rate (FPR); True Negative Rate (TNR); and False Negative Rate (FNR). As we can see in Table 6, the average RMSE on SH attack samples (RMSE-A) deviates considerably from the average RMSE on SH normal samples (RMSE) regardless of the attack type. The deviation is more considerable in the case of attack of types 3 and 4 because the amount of energy increase is more important. The SH anomaly detection algorithm shows high detection performances. It delivers high accuracy and detection rate with low false positive and false negative rates. We observe the best detection rate on attack of type 4, and the lowest false positive rate on attack of type 1.

To highlight the trade-off between TPR and FPR, we relied on the Receiver Operator Characteristic (ROC) curve which plots the TPR (y-axis) against the FPR (x-axis). Figure 7 shows the ROC curves of three customers with best, intermediate, and worst performances of attack of types 1, 2, 3, 4, and all the attack types combined, respectively. As we can notice, the curves are closer to the top-left corner indicating a good performance on detecting the four attack types combined or separated. The ROC curve 7(d) confirms that CPADF delivers the best detection performances on attack of type 4. Figure 7(e) shows the capacity of CPADF to maintain high detection rate with low false positive rate against all attack types combined. A summary of the NBH anomaly detection results is listed in Table 7, the NBH detection algorithm shows high detection performance. We observe the best detection rate on attack of type 3, and the lowest false positive rate on attack of type 1.

Refer to caption
(a) Attack of type 1
Refer to caption
(b) Attack of type 2
Refer to caption
(c) Attack of type 3
Refer to caption
(d) Attack of type 4
Refer to caption
(e) All attacks
Figure 7: ROC curves of SH anomaly detection system
Table 6: SHs Anomaly Detection.
RMSE RMSE-A AC TPR FPR TNR FNR
Type 1 0.332 0.965 0.901 0.918 0.104 0.896 0.082
Type 2 0.332 0.942 0.903 0.920 0.111 0.889 0.080
Type 3 0.332 2.625 0.924 0.989 0.120 0.880 0.011
Type 4 0.332 2.617 0.947 0.991 0.120 0.880 0.009
All 0.332 2.673 0.907 0.965 0.161 0.839 0.035
Table 7: NBH Anomaly Detection
RMSE RMSE-A AC TPR FPR TNR FNR
Type 1 478.770 1168.923 0.890 0.882 0.108 0.892 0.118
Type 2 478.770 1130.237 0.893 0.909 0.121 0.879 0.091
Type 3 478.770 2678.249 0.899 0.998 0.168 0.832 0.002
Type 4 478.770 2687.301 0.931 0.993 0.168 0.832 0.007
All 478.770 2923.112 0.901 0.958 0.166 0.834 0.042

6.4 Discussion and comparison

The CPADF shows high accuracy on detecting the four attack types combined or separated, at both SH and NBH levels. However, in the context of smart grids, the two classes (attack and normal) are not equally important. It is known that TPR would be the metric to use when there is a high impact associated with false negative (attack classified as normal). It is safer for the system to tolerate false positive (normal consumption change detected as attack) rather than false negative. The impact of false negative would be extremely high if the target system is connected to other systems (cascading failures). Against all attacks combined, the TPR is higher than 96 %, and the FNR is less than 4 %. The superior TPR on detecting attack types 3 and 4 shows the effectiveness of CPADF when there are drastic changes in consumption patterns, as illustrated in Figure 8. The highest FNR is noticed on attack of type 1, due the fact that during peak hours, differentiating between legitimate and malicious consumption increase is more challenging. Furthermore, this is in part because the random generation of the amount of energy increase can in some cases return consumption values which are close to normal consumption values. Due the aforementioned facts, a slight drop in TPR at NBH level can be observed in the cases of attack types 1 and 2 (see Figure 8). The results showed the effectiveness and high performances of CPADF on detecting different types of overloading cyberattacks at SH and NBH levels.

According to [25] an anomaly-based detection system must fulfil a set of requirements to be suitable to the smart grid context: operational performance ([R1]); reliability and integrity in the control ([R2]); resilience ([R3]); security ([R4]) and privacy ([R5]). CPADF complies with security and resilience requirements ([R3, R4, R5]) thanks to the periodic retraining ensuring incremental learning to update the knowledge of the system with new legitimate consumption patterns. Using RMSE in threshold calculation allows controlling subtle changes, while two-level monitoring (home and neighborhood) of the consumption load enables controlling drastic changes in electricity consumption and load demand. The decision trees low computational complexity and fast learning, along with their comprehensible outputs to humans [25], makes CPADF meets the operational requirement ([R1]). Furthermore, CPADF two-level monitoring, accuracy and low false positive/negative rate allows understanding the electricity consumption changes so as to act accordingly ([R2]).

Refer to caption
Figure 8: Comparison of TPR among attack types
Table 8: Comparison among anomaly detection systems

Jokar et al. [10] Ford et al. [17] Cody et al. [18] CPADF HD (%) 70 68.75 NA 79.4 DR (%) 86 93.75 NA 96 FPR (%) 16 25 NA 16.6 RMSE NA 0.33 0.47 0.29 Anomaly Type Energy theft Grid overloading Energy theft Grid overloading Energy theft Grid overloading Load fluctuation

To the best of our knowledge, there are only three papers [17, 18, 10] which have used the same dataset [19] for AMI anomaly detection. Ford et al. [17] and Cody et al. [18] used neural network and decision tree, receptively, to detect two types of energy fraud, whereas Jokar et al [10] used SVM based classification to detect energy theft. Since the performance of anomaly detection depends on the accuracy of energy prediction, we first compare the energy prediction performance of CPADF with Ford et al. [17] and Cody et al. [18]. For the sake of fairness, we consider the same experiments used in [17, 18]. The aim of the first experiment is to evaluate to which extent the regression model can predict electricity consumption for the same month a year after the training set, as in [17, 18], we exploited August 2009 for training, and August 2010 for validation. Experiment 2 examines the ability to predict electricity consumption the week following several weeks, as in [17, 18] we considered weeks from September. To evaluate the ability of electricity prediction within the same weather season, experiment 3 uses electricity consumption from June 2010 for training, then validated the model on July of the same year. The three experiment results are presented in Figure 9, as we can notice CPADF provides the lowest root mean squared error for the three experiments.

Refer to caption
Figure 9: Comparison of prediction error between CPADF and the state of the art.

Table 8 displays a comparison between the anomaly detection overall performances of CPADF, [17], [18] and [10]. As we can see, CPADF provides the best detection rate and the lowest prediction error (RMSE), however CPADF presents 0.6% of extra FPR in comparison with [10]. The proposed system in [10] uses synthetic malicious samples to build the system, which may cause FNR to increase when the malicious pattern changes, because the classifier would not detect attack types that deviate significantly from the synthetic malicious samples used to train the system.

7 Conclusion

In this paper, grid overloading cyberattacks in the context of smart grid AMI are considered. These cyberattacks aim at increasing the energy usage and load fluctuation to disturb the power grid and cause a large area blackout. After analyzing them, CPADF a distributed anomaly detection system based on regression decision trees is proposed. CPADF relies on the predictability of smart home and neighborhood consumption patterns. We showed that CPADF can detect grid overloading cyberattacks regardless of the strategy employed by the attacker and with an optimal detection delay. The simulation results on a real dataset of 500 customers demonstrate that CPADF provides a high detection rate and a low false positive rate with short running time and memory requirement. As future work, we need to explore more cyberattacks and to improve the anomaly detection algorithm using more sophisticated machine learning methods.

References