-
Gravitational wave: generation and detection techniques
Authors:
Saibal Ray,
R. Bhattacharya,
Sanjay K. Sahay,
Abdul Aziz,
Amit Das
Abstract:
In this paper, we review the theoretical basis for generation of gravitational waves and the detection techniques used to detect a gravitational wave. To materialize this goal in a thorough way we first start with a mathematical background for general relativity from which a clue for gravitational wave was conceived by Einstein. Thereafter we give the classification scheme of gravitational waves s…
▽ More
In this paper, we review the theoretical basis for generation of gravitational waves and the detection techniques used to detect a gravitational wave. To materialize this goal in a thorough way we first start with a mathematical background for general relativity from which a clue for gravitational wave was conceived by Einstein. Thereafter we give the classification scheme of gravitational waves such as (i) continuous gravitational waves, (ii) compact binary inspiral gravitational waves and (iii) stochastic gravitational wave. Necessary mathematical insight into gravitational waves from binaries are also dealt with which follows detection of gravitational waves based on the frequency classification. Ground based observatories as well as space borne gravitational wave detectors are discussed in a length. We have provided an overview on the inflationary gravitational waves. In connection to data analysis by matched filtering there are a few highlights on the techniques, e.g. (i) Random noise, (ii) power spectrum, (iii) shot noise, and (iv) Gaussian noise. Optimal detection statistics for a gravitational wave detection is also in the pipeline of the discussion along with detailed necessity of the matched filter and deep learning.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Deep Reinforcement Learning for Cybersecurity Threat Detection and Protection: A Review
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
The cybersecurity threat landscape has lately become overly complex. Threat actors leverage weaknesses in the network and endpoint security in a very coordinated manner to perpetuate sophisticated attacks that could bring down the entire network and many critical hosts in the network. Increasingly advanced deep and machine learning-based solutions have been used in threat detection and protection.…
▽ More
The cybersecurity threat landscape has lately become overly complex. Threat actors leverage weaknesses in the network and endpoint security in a very coordinated manner to perpetuate sophisticated attacks that could bring down the entire network and many critical hosts in the network. Increasingly advanced deep and machine learning-based solutions have been used in threat detection and protection. The application of these techniques has been reviewed well in the scientific literature. Deep Reinforcement Learning has shown great promise in develo** AI-based solutions for areas that had earlier required advanced human cognizance. Different techniques and algorithms under deep reinforcement learning have shown great promise in applications ranging from games to industrial processes, where it is claimed to augment systems with general AI capabilities. These algorithms have recently also been used in cybersecurity, especially in threat detection and endpoint protection, where these are showing state-of-the-art results. Unlike supervised machines and deep learning, deep reinforcement learning is used in more diverse ways and is empowering many innovative applications in the threat defense landscape. However, there does not exist any comprehensive review of these unique applications and accomplishments. Therefore, in this paper, we intend to fill this gap and provide a comprehensive review of the different applications of deep reinforcement learning in cybersecurity threat detection and protection.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
Privacy-Preserving Mutual Authentication and Key Agreement Scheme for Multi-Server Healthcare System
Authors:
Trupil Limbasiya,
Sanjay K. Sahay,
Bharath Sridharan
Abstract:
The usage of different technologies and smart devices helps people to get medical services remotely for multiple benefits. Thus, critical and sensitive data is exchanged between a user and a doctor. When health data is transmitted over a common channel, it becomes essential to preserve various privacy and security properties in the system. Further, the number of users for remote services is increa…
▽ More
The usage of different technologies and smart devices helps people to get medical services remotely for multiple benefits. Thus, critical and sensitive data is exchanged between a user and a doctor. When health data is transmitted over a common channel, it becomes essential to preserve various privacy and security properties in the system. Further, the number of users for remote services is increasing day-by-day exponentially, and thus, it is not adequate to deal with all users using the one server due to the verification overhead, server failure, and scalability issues. Thus, researchers proposed various authentication protocols for multi-server architecture, but most of them are vulnerable to different security attacks and require high computational resources during the implementation. To Tackle privacy and security issues using less computational resources, we propose a privacy-preserving mutual authentication and key agreement protocol for a multi-server healthcare system. We discuss the proposed scheme's security analysis and performance results to understand its security strengths and the computational resource requirement, respectively. Further, we do the comparison of security and performance results with recent relevant authentication protocols.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
ADVERSARIALuscator: An Adversarial-DRL Based Obfuscator and Metamorphic Malware SwarmGenerator
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
Advanced metamorphic malware and ransomware, by using obfuscation, could alter their internal structure with every attack. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. It is challenging to obtain training data for such evasive malware. Therefore, in this paper, we pres…
▽ More
Advanced metamorphic malware and ransomware, by using obfuscation, could alter their internal structure with every attack. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. It is challenging to obtain training data for such evasive malware. Therefore, in this paper, we present ADVERSARIALuscator, a novel system that uses specialized Adversarial-DRL to obfuscate malware at the opcode level and create multiple metamorphic instances of the same. To the best of our knowledge, ADVERSARIALuscator is the first-ever system that adopts the Markov Decision Process-based approach to convert and find a solution to the problem of creating individual obfuscations at the opcode level. This is important as the machine language level is the least at which functionality could be preserved so as to mimic an actual attack effectively. ADVERSARIALuscator is also the first-ever system to use efficient continuous action control capable of deep reinforcement learning agents like the Proximal Policy Optimization in the area of cyber security. Experimental results indicate that ADVERSARIALuscator could raise the metamorphic probability of a corpus of malware by >0.45. Additionally, more than 33% of metamorphic instances generated by ADVERSARIALuscator were able to evade the most potent IDS. If such malware could intrude even into any of the IoT networks, then even if the original malware instance gets detected, by that time it can still infect the entire network. Hence ADVERSARIALuscator could be used to generate data representative of a swarm of very potent and coordinated AI-based metamorphic malware attacks. The so generated data and simulations could be used to bolster the defenses of an IDS against an actual AI-based metamorphic attack from advanced malware and ransomware.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
LSTM Hyper-Parameter Selection for Malware Detection: Interaction Effects and Hierarchical Selection Approach
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
Long-Short-Term-Memory (LSTM) networks have shown great promise in artificial intelligence (AI) based language modeling. Recently, LSTM networks have also become popular for designing AI-based Intrusion Detection Systems (IDS). However, its applicability in IDS is studied largely in the default settings as used in language models. Whereas security applications offer distinct conditions and hence w…
▽ More
Long-Short-Term-Memory (LSTM) networks have shown great promise in artificial intelligence (AI) based language modeling. Recently, LSTM networks have also become popular for designing AI-based Intrusion Detection Systems (IDS). However, its applicability in IDS is studied largely in the default settings as used in language models. Whereas security applications offer distinct conditions and hence warrant careful consideration while applying such recurrent networks. Therefore, we conducted one of the most exhaustive works on LSTM hyper-parameters for IDS and experimented with approx. 150 LSTM configurations to determine its hyper-parameters relative importance, interaction effects, and optimal selection approach for designing an IDS. We conducted multiple analyses of the results of these experiments and empirically controlled for the interaction effects of different hyper-parameters covariate levels. We found that for security applications, especially for designing an IDS, neither similar relative importance as applicable to language models is valid, nor is the standard linear method for hyper-parameter selection ideal. We ascertained that the interaction effect plays a crucial role in determining the relative importance of hyper-parameters. We also discovered that after controlling for the interaction effect, the correct relative importance for LSTMs for an IDS is batch-size, followed by dropout ratio and padding. The findings are significant because when LSTM was first used for language models, the focus had mostly been on increasing the number of layers to enhance performance.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
DRo: A data-scarce mechanism to revolutionize the performance of Deep Learning based Security Systems
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
Supervised Deep Learning requires plenty of labeled data to converge, and hence perform optimally for task-specific learning. Therefore, we propose a novel mechanism named DRo (for Deep Routing) for data-scarce domains like security. The DRo approach builds upon some of the recent developments in Deep-Clustering. In particular, it exploits the self-augmented training mechanism using synthetically…
▽ More
Supervised Deep Learning requires plenty of labeled data to converge, and hence perform optimally for task-specific learning. Therefore, we propose a novel mechanism named DRo (for Deep Routing) for data-scarce domains like security. The DRo approach builds upon some of the recent developments in Deep-Clustering. In particular, it exploits the self-augmented training mechanism using synthetically generated local perturbations. DRo not only allays the challenges with sparse-labeled data but also offers many unique advantages. We also developed a system named DRoID that uses the DRo mechanism for enhancing the performance of an existing Malware Detection System that uses (low information features like the) Android implicit Intent(s) as the only features. We conduct experiments on DRoID using a popular and standardized Android malware dataset and found that the DRo mechanism could successfully reduce the false-alarms generated by the downstream classifier by 67.9%, and also simultaneously boosts its accuracy by 11.3%. This is significant not only because the gains achieved are unparalleled but also because the features used were never considered rich enough to train a classifier on; and hence no decent performance could ever be reported by any malware classification system till-date using these features in isolation. Owing to the results achieved, the DRo mechanism claims a dominant position amongst all known systems that aims to enhance the classification performance of deep learning models with sparse-labeled data.
△ Less
Submitted 12 September, 2021;
originally announced September 2021.
-
Identification of Significant Permissions for Efficient Android Malware Detection
Authors:
Hemant Rathore,
Sanjay K. Sahay,
Ritvik Rajvanshi,
Mohit Sewak
Abstract:
Since Google unveiled Android OS for smartphones, malware are thriving with 3Vs, i.e. volume, velocity, and variety. A recent report indicates that one out of every five business/industry mobile application leaks sensitive personal data. Traditional signature/heuristic-based malware detection systems are unable to cope up with current malware challenges and thus threaten the Android ecosystem. The…
▽ More
Since Google unveiled Android OS for smartphones, malware are thriving with 3Vs, i.e. volume, velocity, and variety. A recent report indicates that one out of every five business/industry mobile application leaks sensitive personal data. Traditional signature/heuristic-based malware detection systems are unable to cope up with current malware challenges and thus threaten the Android ecosystem. Therefore recently researchers have started exploring machine learning and deep learning based malware detection systems. In this paper, we performed a comprehensive feature analysis to identify the significant Android permissions and propose an efficient Android malware detection system using machine learning and deep neural network. We constructed a set of $16$ permissions ($8\%$ of the total set) derived from variance threshold, auto-encoders, and principal component analysis to build a malware detection engine that consumes less train and test time without significant compromise on the model accuracy. Our experimental results show that the Android malware detection model based on the random forest classifier is most balanced and achieves the highest area under curve score of $97.7\%$, which is better than the current state-of-art systems. We also observed that deep neural networks attain comparable accuracy to the baseline results but with a massive computational penalty.
△ Less
Submitted 28 February, 2021;
originally announced March 2021.
-
Detection of Malicious Android Applications: Classical Machine Learning vs. Deep Neural Network Integrated with Clustering
Authors:
Hemant Rathore,
Sanjay K. Sahay,
Shivin Thukral,
Mohit Sewak
Abstract:
Today anti-malware community is facing challenges due to the ever-increasing sophistication and volume of malware attacks developed by adversaries. Traditional malware detection mechanisms are not able to cope-up with next-generation malware attacks. Therefore in this paper, we propose effective and efficient Android malware detection models based on machine learning and deep learning integrated w…
▽ More
Today anti-malware community is facing challenges due to the ever-increasing sophistication and volume of malware attacks developed by adversaries. Traditional malware detection mechanisms are not able to cope-up with next-generation malware attacks. Therefore in this paper, we propose effective and efficient Android malware detection models based on machine learning and deep learning integrated with clustering. We performed a comprehensive study of different feature reduction, classification and clustering algorithms over various performance metrics to construct the Android malware detection models. Our experimental results show that malware detection models developed using Random Forest eclipsed deep neural network and other classifiers on the majority of performance metrics. The baseline Random Forest model without any feature reduction achieved the highest AUC of 99.4%. Also, the segregating of vector space using clustering integrated with Random Forest further boosted the AUC to 99.6% in one cluster and direct detection of Android malware in another cluster, thus reducing the curse of dimensionality. Additionally, we found that feature reduction in detection models does improve the model efficiency (training and testing time) many folds without much penalty on the effectiveness of the detection model.
△ Less
Submitted 28 February, 2021;
originally announced March 2021.
-
DRLDO: A novel DRL based De-ObfuscationSystem for Defense against Metamorphic Malware
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
In this paper, we propose a novel mechanism to normalize metamorphic and obfuscated malware down at the opcode level and hence create an advanced metamorphic malware de-obfuscation and defense system. We name this system DRLDO, for Deep Reinforcement Learning based De-Obfuscator. With the inclusion of the DRLDO as a sub-component, an existing Intrusion Detection System could be augmented with defe…
▽ More
In this paper, we propose a novel mechanism to normalize metamorphic and obfuscated malware down at the opcode level and hence create an advanced metamorphic malware de-obfuscation and defense system. We name this system DRLDO, for Deep Reinforcement Learning based De-Obfuscator. With the inclusion of the DRLDO as a sub-component, an existing Intrusion Detection System could be augmented with defensive capabilities against 'zero-day' attacks from obfuscated and metamorphic variants of existing malware. This gains importance, not only because there exists no system to date that uses advanced DRL to intelligently and automatically normalize obfuscation down even to the opcode level, but also because the DRLDO system does not mandate any changes to the existing IDS. The DRLDO system does not even mandate the IDS' classifier to be retrained with any new dataset containing obfuscated samples. Hence DRLDO could be easily retrofitted into any existing IDS deployment. We designed, developed, and conducted experiments on the system to evaluate the same against multiple-simultaneous attacks from obfuscations generated from malware samples from a standardized dataset that contains multiple generations of malware. Experimental results prove that DRLDO was able to successfully make the otherwise un-detectable obfuscated variants of the malware detectable by an existing pre-trained malware classifier. The detection probability was raised well above the cut-off mark to 0.6 for the classifier to detect the obfuscated malware unambiguously. Further, the de-obfuscated variants generated by DRLDO achieved a very high correlation (of 0.99) with the base malware. This observation validates that the DRLDO system is actually learning to de-obfuscate and not exploiting a trivial trick.
△ Less
Submitted 1 February, 2021;
originally announced February 2021.
-
Robust Android Malware Detection System against Adversarial Attacks using Q-Learning
Authors:
Hemant Rathore,
Sanjay K. Sahay,
Piyush Nikam,
Mohit Sewak
Abstract:
The current state-of-the-art Android malware detection systems are based on machine learning and deep learning models. Despite having superior performance, these models are susceptible to adversarial attacks. Therefore in this paper, we developed eight Android malware detection models based on machine learning and deep neural network and investigated their robustness against adversarial attacks. F…
▽ More
The current state-of-the-art Android malware detection systems are based on machine learning and deep learning models. Despite having superior performance, these models are susceptible to adversarial attacks. Therefore in this paper, we developed eight Android malware detection models based on machine learning and deep neural network and investigated their robustness against adversarial attacks. For this purpose, we created new variants of malware using Reinforcement Learning, which will be misclassified as benign by the existing Android malware detection models. We propose two novel attack strategies, namely single policy attack and multiple policy attack using reinforcement learning for white-box and grey-box scenario respectively. Putting ourselves in the adversary's shoes, we designed adversarial attacks on the detection models with the goal of maximizing fooling rate, while making minimum modifications to the Android application and ensuring that the app's functionality and behavior do not change. We achieved an average fooling rate of 44.21% and 53.20% across all the eight detection models with a maximum of five modifications using a single policy attack and multiple policy attack, respectively. The highest fooling rate of 86.09% with five changes was attained against the decision tree-based model using the multiple policy approach. Finally, we propose an adversarial defense strategy that reduces the average fooling rate by threefold to 15.22% against a single policy attack, thereby increasing the robustness of the detection models i.e. the proposed model can effectively detect variants (metamorphic) of malware. The experimental analysis shows that our proposed Android malware detection system using reinforcement learning is more robust against adversarial attacks.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
Assessment of the Relative Importance of different hyper-parameters of LSTM for an IDS
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
Recurrent deep learning language models like the LSTM are often used to provide advanced cyber-defense for high-value assets. The underlying assumption for using LSTM networks for malware-detection is that the op-code sequence of malware could be treated as a (spoken) language representation. There are differences between any spoken-language (sequence of words/sentences) and the machine-language (…
▽ More
Recurrent deep learning language models like the LSTM are often used to provide advanced cyber-defense for high-value assets. The underlying assumption for using LSTM networks for malware-detection is that the op-code sequence of malware could be treated as a (spoken) language representation. There are differences between any spoken-language (sequence of words/sentences) and the machine-language (sequence of op-codes). In this paper, we demonstrate that due to these inherent differences, an LSTM model with its default configuration as tuned for a spoken-language, may not work well to detect malware (using its op-code sequence) unless the network's essential hyper-parameters are tuned appropriately. In the process, we also determine the relative importance of all the different hyper-parameters of an LSTM network as applied to malware detection using their op-code sequence representations. We experimented with different configurations of LSTM networks, and altered hyper-parameters like the embedding-size, number of hidden layers, number of LSTM-units in a hidden layer, pruning/padding-length of the input-vector, activation-function, and batch-size. We discovered that owing to the enhanced complexity of the malware/machine-language, the performance of an LSTM network configured for an Intrusion Detection System, is very sensitive towards the number-of-hidden-layers, input sequence-length, and the choice of the activation-function. Also, for (spoken) language-modeling, the recurrent architectures by-far outperform their non-recurrent counterparts. Therefore, we also assess how sequential DL architectures like the LSTM compare against their non-sequential counterparts like the MLP-DNN for the purpose of malware-detection.
△ Less
Submitted 26 December, 2020;
originally announced December 2020.
-
DOOM: A Novel Adversarial-DRL-Based Op-Code Level Metamorphic Malware Obfuscator for the Enhancement of IDS
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
We designed and developed DOOM (Adversarial-DRL based Opcode level Obfuscator to generate Metamorphic malware), a novel system that uses adversarial deep reinforcement learning to obfuscate malware at the op-code level for the enhancement of IDS. The ultimate goal of DOOM is not to give a potent weapon in the hands of cyber-attackers, but to create defensive-mechanisms against advanced zero-day at…
▽ More
We designed and developed DOOM (Adversarial-DRL based Opcode level Obfuscator to generate Metamorphic malware), a novel system that uses adversarial deep reinforcement learning to obfuscate malware at the op-code level for the enhancement of IDS. The ultimate goal of DOOM is not to give a potent weapon in the hands of cyber-attackers, but to create defensive-mechanisms against advanced zero-day attacks. Experimental results indicate that the obfuscated malware created by DOOM could effectively mimic multiple-simultaneous zero-day attacks. To the best of our knowledge, DOOM is the first system that could generate obfuscated malware detailed to individual op-code level. DOOM is also the first-ever system to use efficient continuous action control based deep reinforcement learning in the area of malware generation and defense. Experimental results indicate that over 67% of the metamorphic malware generated by DOOM could easily evade detection from even the most potent IDS. This achievement gains significance, as with this, even IDS augment with advanced routing sub-system can be easily evaded by the malware generated by DOOM.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
DeepIntent: ImplicitIntent based Android IDS with E2E Deep Learning architecture
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
The Intent in Android plays an important role in inter-process and intra-process communications. The implicit Intent that an application could accept are declared in its manifest and are amongst the easiest feature to extract from an apk. Implicit Intents could even be extracted online and in real-time. So far neither the feasibility of develo** an Intrusion Detection System solely on implicit I…
▽ More
The Intent in Android plays an important role in inter-process and intra-process communications. The implicit Intent that an application could accept are declared in its manifest and are amongst the easiest feature to extract from an apk. Implicit Intents could even be extracted online and in real-time. So far neither the feasibility of develo** an Intrusion Detection System solely on implicit Intent has been explored, nor are any benchmarks available of a malware classifier that is based on implicit Intent alone. We demonstrate that despite Intent is implicit and well declared, it can provide very intuitive insights to distinguish malicious from non-malicious applications. We conducted exhaustive experiments with over 40 different end-to-end Deep Learning configurations of Auto-Encoders and Multi-Layer-Perceptron to create a benchmark for a malware classifier that works exclusively on implicit Intent. Using the results from the experiments we create an intrusion detection system using only the implicit Intents and end-to-end Deep Learning architecture. We obtained an area-under-curve statistic of 0.81, and accuracy of 77.2% along with false-positive-rate of 0.11 on Drebin dataset.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
A Novel Spatial-Spectral Framework for the Classification of Hyperspectral Satellite Imagery
Authors:
Shriya TP Gupta,
Sanjay K Sahay
Abstract:
Hyper-spectral satellite imagery is now widely being used for accurate disaster prediction and terrain feature classification. However, in such classification tasks, most of the present approaches use only the spectral information contained in the images. Therefore, in this paper, we present a novel framework that takes into account both the spectral and spatial information contained in the data f…
▽ More
Hyper-spectral satellite imagery is now widely being used for accurate disaster prediction and terrain feature classification. However, in such classification tasks, most of the present approaches use only the spectral information contained in the images. Therefore, in this paper, we present a novel framework that takes into account both the spectral and spatial information contained in the data for land cover classification. For this purpose, we use the Gaussian Maximum Likelihood (GML) and Convolutional Neural Network methods for the pixel-wise spectral classification and then, using segmentation maps generated by the Watershed algorithm, we incorporate the spatial contextual information into our model with a modified majority vote technique. The experimental analyses on two benchmark datasets demonstrate that our proposed methodology performs better than the earlier approaches by achieving an accuracy of 99.52% and 98.31% on the Pavia University and the Indian Pines datasets respectively. Additionally, our GML based approach, a non-deep learning algorithm, shows comparable performance to the state-of-the-art deep learning techniques, which indicates the importance of the proposed approach for performing a computationally efficient classification of hyper-spectral imagery.
△ Less
Submitted 22 July, 2020;
originally announced August 2020.
-
Secure and Energy-Efficient Key-Agreement Protocol for Multi-Server Architecture
Authors:
Trupil Limbasiya,
Sanjay K. Sahay
Abstract:
Authentication schemes are practised globally to verify the legitimacy of users and servers for the exchange of data in different facilities. Generally, the server verifies a user to provide resources for different purposes. But due to the large network system, the authentication process has become complex and therefore, time-to-time different authentication protocols have been proposed for the mu…
▽ More
Authentication schemes are practised globally to verify the legitimacy of users and servers for the exchange of data in different facilities. Generally, the server verifies a user to provide resources for different purposes. But due to the large network system, the authentication process has become complex and therefore, time-to-time different authentication protocols have been proposed for the multi-server architecture. However, most of the protocols are vulnerable to various security attacks and their performance is not efficient. In this paper, we propose a secure and energy-efficient remote user authentication protocol for multi-server systems. The results show that the proposed protocol is comparatively ~44% more efficient and needs ~38% less communication cost. We also demonstrate that with only two-factor authentication, the proposed protocol is more secure from the earlier related authentication schemes.
△ Less
Submitted 19 April, 2020;
originally announced April 2020.
-
Secure Communication Protocol for Smart Transportation Based on Vehicular Cloud
Authors:
Trupil Limbasiya,
Debasis Das,
Sanjay K. Sahay
Abstract:
The pioneering concept of connected vehicles has transformed the way of thinking for researchers and entrepreneurs by collecting relevant data from nearby objects. However, this data is useful for a specific vehicle only. Moreover, vehicles get a high amount of data (e.g., traffic, safety, and multimedia infotainment) on the road. Thus, vehicles expect adequate storage devices for this data, but i…
▽ More
The pioneering concept of connected vehicles has transformed the way of thinking for researchers and entrepreneurs by collecting relevant data from nearby objects. However, this data is useful for a specific vehicle only. Moreover, vehicles get a high amount of data (e.g., traffic, safety, and multimedia infotainment) on the road. Thus, vehicles expect adequate storage devices for this data, but it is infeasible to have a large memory in each vehicle. Hence, the vehicular cloud computing (VCC) framework came into the picture to provide a storage facility by connecting a road-side-unit (RSU) with the vehicular cloud (VC). In this, data should be saved in an encrypted form to preserve security, but there is a challenge to search for information over encrypted data. Next, we understand that many of vehicular communication schemes are inefficient for data transmissions due to its poor performance results and vulnerable to different fundamental security attacks. Accordingly, on-device performance is critical, but data damages and secure on-time connectivity are also significant challenges in a public environment. Therefore, we propose reliable data transmission protocols for cutting-edge architecture to search data from the storage, to resist against various security attacks, and provide better performance results. Thus, the proposed data transmission protocol is useful in diverse smart city applications (business, safety, and entertainment) for the benefits of society.
△ Less
Submitted 4 January, 2020; v1 submitted 30 December, 2019;
originally announced December 2019.
-
A Survey on the Detection of Android Malicious Apps
Authors:
Sanjay K. Sahay,
Ashu Sharma
Abstract:
Android-based smart devices are exponentially growing, and due to the ubiquity of the Internet, these devices are globally connected to the different devices/networks. Its popularity, attractive features, and mobility make malware creator to put a number of malicious apps in the market to disrupt and annoy the victims. Although to identify the malicious apps, time-to-time various techniques are pr…
▽ More
Android-based smart devices are exponentially growing, and due to the ubiquity of the Internet, these devices are globally connected to the different devices/networks. Its popularity, attractive features, and mobility make malware creator to put a number of malicious apps in the market to disrupt and annoy the victims. Although to identify the malicious apps, time-to-time various techniques are proposed. However, it appears that malware developers are always ahead of the anti-malware group, and the proposed techniques by the anti-malware groups are not sufficient to counter the advanced malicious apps. Therefore, to understand the various techniques proposed/used for the identification of Android malicious apps, in this paper, we present a survey conducted by us on the work done by the researchers in this field.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
An Efficient Detection of Malware by Naive Bayes Classifier Using GPGPU
Authors:
Sanjay K. Sahay,
Mayank Chaudhari
Abstract:
Due to continuous increase in the number of malware (according to AV-Test institute total ~8 x 10^8 malware are already known, and every day they register ~2.5 x 10^4 malware) and files in the computational devices, it is very important to design a system which not only effectively but can also efficiently detect the new or previously unseen malware to prevent/minimize the damages. Therefore, this…
▽ More
Due to continuous increase in the number of malware (according to AV-Test institute total ~8 x 10^8 malware are already known, and every day they register ~2.5 x 10^4 malware) and files in the computational devices, it is very important to design a system which not only effectively but can also efficiently detect the new or previously unseen malware to prevent/minimize the damages. Therefore, this paper presents a novel group-wise approach for the efficient detection of malware by parallelizing the classification using the power of GPGPU and shown that by using the Naive Bayes classifier the detection speed-up can be boosted up to 200x. The investigation also shows that the classification time increases significantly with the number of features.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
Android Malicious Application Classification Using Clustering
Authors:
Hemant Rathore,
Sanjay K. Sahay,
Palash Chaturvedi,
Mohit Sewak
Abstract:
Android malware have been growing at an exponential pace and becomes a serious threat to mobile users. It appears that most of the anti-malware still relies on the signature-based detection system which is generally slow and often not able to detect advanced obfuscated malware. Hence time-to-time various authors have proposed different machine learning solutions to identify sophisticated malware.…
▽ More
Android malware have been growing at an exponential pace and becomes a serious threat to mobile users. It appears that most of the anti-malware still relies on the signature-based detection system which is generally slow and often not able to detect advanced obfuscated malware. Hence time-to-time various authors have proposed different machine learning solutions to identify sophisticated malware. However, it appears that detection accuracy can be improved by using the clustering method. Therefore in this paper, we propose a novel scalable and effective clustering method to improve the detection accuracy of the malicious android application and obtained a better overall accuracy (98.34%) by random forest classifier compared to regular method, i.e., taking the data altogether to detect the malware. However, as far as true positive and true negative are concerned, by clustering method, true positive is best obtained by decision tree (97.59%) and true negative by support vector machine (99.96%) which is the almost same result obtained by the random forest true positive (97.30%) and true negative (99.38%) respectively. The reason that overall accuracy of random forest is high because the true positive of support vector machine and true negative of the decision tree is significantly less than the random forest.
△ Less
Submitted 21 April, 2019;
originally announced April 2019.
-
Malware Detection using Machine Learning and Deep Learning
Authors:
Hemant Rathore,
Swati Agarwal,
Sanjay K. Sahay,
Mohit Sewak
Abstract:
Research shows that over the last decade, malware has been growing exponentially, causing substantial financial losses to various organizations. Different anti-malware companies have been proposing solutions to defend attacks from these malware. The velocity, volume, and the complexity of malware are posing new challenges to the anti-malware community. Current state-of-the-art research shows that…
▽ More
Research shows that over the last decade, malware has been growing exponentially, causing substantial financial losses to various organizations. Different anti-malware companies have been proposing solutions to defend attacks from these malware. The velocity, volume, and the complexity of malware are posing new challenges to the anti-malware community. Current state-of-the-art research shows that recently, researchers and anti-virus organizations started applying machine learning and deep learning methods for malware analysis and detection. We have used opcode frequency as a feature vector and applied unsupervised learning in addition to supervised learning for malware classification. The focus of this tutorial is to present our work on detecting malware with 1) various machine learning algorithms and 2) deep learning models. Our results show that the Random Forest outperforms Deep Neural Network with opcode frequency as a feature. Also in feature reduction, Deep Auto-Encoders are overkill for the dataset, and elementary function like Variance Threshold perform better than others. In addition to the proposed methodologies, we will also discuss the additional issues and the unique challenges in the domain, open research problems, limitations, and future directions.
△ Less
Submitted 4 April, 2019;
originally announced April 2019.
-
Group-wise classification approach to improve Android malicious apps detection accuracy
Authors:
Ashu Sharma,
Sanjay K. Sahay
Abstract:
In the fast-growing smart devices, Android is the most popular OS, and due to its attractive features, mobility, ease of use, these devices hold sensitive information such as personal data, browsing history, shop** history, financial details, etc. Therefore, any security gap in these devices means that the information stored or accessing the smart devices are at high risk of being breached by th…
▽ More
In the fast-growing smart devices, Android is the most popular OS, and due to its attractive features, mobility, ease of use, these devices hold sensitive information such as personal data, browsing history, shop** history, financial details, etc. Therefore, any security gap in these devices means that the information stored or accessing the smart devices are at high risk of being breached by the malware. These malware are continuously growing and are also used for military espionage, disrupting the industry, power grids, etc. To detect these malware, traditional signature matching techniques are widely used. However, such strategies are not capable to detect the advanced Android malicious apps because malware developer uses several obfuscation techniques. Hence, researchers are continuously addressing the security issues in the Android based smart devices. Therefore, in this paper using Drebin benchmark malware dataset we experimentally demonstrate how to improve the detection accuracy by analyzing the apps after grou** the collected data based on the permissions and achieved 97.15% overall average accuracy. Our results outperform the accuracy obtained without grou** data (79.27%, 2017), Arp, et al. (94%, 2014), Annamalai et al. (84.29%, 2016), Bahman Rashidi et al. (82%, 2017)) and Ali Feizollah, et al. (95.5%, 2017). The analysis also shows that among the groups, Microphone group detection accuracy is least while Calendar group apps are detected with the highest accuracy, and with the highest accuracy, and for the best performance, one shall take 80-100 features.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Detection of Advanced Malware by Machine Learning Techniques
Authors:
Sanjay Sharma,
C. Rama Krishna,
Sanjay K. Sahay
Abstract:
In today's digital world most of the anti-malware tools are signature based which is ineffective to detect advanced unknown malware viz. metamorphic malware. In this paper, we study the frequency of opcode occurrence to detect unknown malware by using machine learning technique. For the purpose, we have used kaggle Microsoft malware classification challenge dataset. The top 20 features obtained fr…
▽ More
In today's digital world most of the anti-malware tools are signature based which is ineffective to detect advanced unknown malware viz. metamorphic malware. In this paper, we study the frequency of opcode occurrence to detect unknown malware by using machine learning technique. For the purpose, we have used kaggle Microsoft malware classification challenge dataset. The top 20 features obtained from fisher score, information gain, gain ratio, chi-square and symmetric uncertainty feature selection methods are compared. We also studied multiple classifier available in WEKA GUI based machine learning tool and found that five of them (Random Forest, LMT, NBT, J48 Graft and REPTree) detect malware with almost 100% accuracy.
△ Less
Submitted 7 March, 2019;
originally announced March 2019.
-
Comparison of Deep Learning and the Classical Machine Learning Algorithm for the Malware Detection
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
Recently, Deep Learning has been showing promising results in various Artificial Intelligence applications like image recognition, natural language processing, language modeling, neural machine translation, etc. Although, in general, it is computationally more expensive as compared to classical machine learning techniques, their results are found to be more effective in some cases. Therefore, in t…
▽ More
Recently, Deep Learning has been showing promising results in various Artificial Intelligence applications like image recognition, natural language processing, language modeling, neural machine translation, etc. Although, in general, it is computationally more expensive as compared to classical machine learning techniques, their results are found to be more effective in some cases. Therefore, in this paper, we investigated and compared one of the Deep Learning Architecture called Deep Neural Network (DNN) with the classical Random Forest (RF) machine learning algorithm for the malware classification. We studied the performance of the classical RF and DNN with 2, 4 & 7 layers architectures with the four different feature sets, and found that irrespective of the features inputs, the classical RF accuracy outperforms the DNN.
△ Less
Submitted 16 September, 2018;
originally announced September 2018.
-
An investigation of a deep learning based malware detection system
Authors:
Mohit Sewak,
Sanjay K. Sahay,
Hemant Rathore
Abstract:
We investigate a Deep Learning based system for malware detection. In the investigation, we experiment with different combination of Deep Learning architectures including Auto-Encoders, and Deep Neural Networks with varying layers over Malicia malware dataset on which earlier studies have obtained an accuracy of (98%) with an acceptable False Positive Rates (1.07%). But these results were done usi…
▽ More
We investigate a Deep Learning based system for malware detection. In the investigation, we experiment with different combination of Deep Learning architectures including Auto-Encoders, and Deep Neural Networks with varying layers over Malicia malware dataset on which earlier studies have obtained an accuracy of (98%) with an acceptable False Positive Rates (1.07%). But these results were done using extensive man-made custom domain features and investing corresponding feature engineering and design efforts. In our proposed approach, besides improving the previous best results (99.21% accuracy and a False Positive Rate of 0.19%) indicates that Deep Learning based systems could deliver an effective defense against malware. Since it is good in automatically extracting higher conceptual features from the data, Deep Learning based systems could provide an effective, general and scalable mechanism for detection of existing and unknown malware.
△ Less
Submitted 16 September, 2018;
originally announced September 2018.
-
An investigation of the classifiers to detect android malicious apps
Authors:
Ashu Sharma,
Sanjay K. Sahay
Abstract:
Android devices are growing exponentially and are connected through the internet accessing billion of online websites. The popularity of these devices encourages malware developer to penetrate the market with malicious apps to annoy and disrupt the victim. Although, for the detection of malicious apps different approaches are discussed. However, proposed approaches are not suffice to detect the ad…
▽ More
Android devices are growing exponentially and are connected through the internet accessing billion of online websites. The popularity of these devices encourages malware developer to penetrate the market with malicious apps to annoy and disrupt the victim. Although, for the detection of malicious apps different approaches are discussed. However, proposed approaches are not suffice to detect the advanced malware to limit/prevent the damages. In this, very few approaches are based on opcode occurrence to classify the malicious apps. Therefore, this paper investigates the five classifiers using opcodes occurrence as the prominent features for the detection of malicious apps. For the analysis, we use WEKA tool and found that FT detection accuracy (79.27%) is best among the investigated classifiers. However, true positives rate i.e. malware detection rate is highest (99.91%) by RF and fluctuate least with the different number of prominent features compared to other studied classifiers. The analysis shows that overall accuracy is majorly affected by the false positives of the classifier.
△ Less
Submitted 23 February, 2018;
originally announced February 2018.
-
A Communication Efficient and Scalable Distributed Data Mining for the Astronomical Data
Authors:
Aruna Govada,
Sanjay K. Sahay
Abstract:
In 2020, ~60PB of archived data will be accessible to the astronomers. But to analyze such a paramount data will be a challenging task. This is basically due to the computational model used to download the data from complex geographically distributed archives to a central site and then analyzing it in the local systems. Because the data has to be downloaded to the central site, the network BW limi…
▽ More
In 2020, ~60PB of archived data will be accessible to the astronomers. But to analyze such a paramount data will be a challenging task. This is basically due to the computational model used to download the data from complex geographically distributed archives to a central site and then analyzing it in the local systems. Because the data has to be downloaded to the central site, the network BW limitation will be a hindrance for the scientific discoveries. Also analyzing this PB-scale on local machines in a centralized manner is challenging. In this virtual observatory is a step towards this problem, however, it does not provide the data mining model. Adding the distributed data mining layer to the VO can be the solution in which the knowledge can be downloaded by the astronomers instead the raw data and thereafter astronomers can either reconstruct the data back from the downloaded knowledge or use the knowledge directly for further analysis.Therefore, in this paper, we present Distributed Load Balancing Principal Component Analysis for optimally distributing the computation among the available nodes to minimize the transmission cost and downloading cost for the end user. The experimental analysis is done with Fundamental Plane(FP) data, Gadotti data and complex Mfeat data. In terms of transmission cost, our approach performs better than Qi. et al. and Yue.et al. The analysis shows that with the complex Mfeat data ~90% downloading cost can be reduced for the end user with the negligible loss in accuracy.
△ Less
Submitted 23 June, 2016;
originally announced June 2016.
-
Covariance estimation for vertically partitioned data in a distributed environment
Authors:
Aruna Govada,
Sanjay K. Sahay
Abstract:
The major sources of abundant data are constantly expanding with the available data collection methodologies in various applications - medical, insurance, scientific, bio-informatics and business. These data sets may be distributed geographically, rich in size and as well as dimensions also. To analyze these data sets to find out the hidden patterns, it is required to down- load the data to a cent…
▽ More
The major sources of abundant data are constantly expanding with the available data collection methodologies in various applications - medical, insurance, scientific, bio-informatics and business. These data sets may be distributed geographically, rich in size and as well as dimensions also. To analyze these data sets to find out the hidden patterns, it is required to down- load the data to a centralized site which is a challenging task in terms of the limited bandwidth available and computationally also expensive. The covariance matrix is one of the methods to estimate the relation between any two dimensions. In this paper, we propose a communication efficient algorithm to estimate the covariance matrix in a distributed manner. The global covariance matrix is computed by merging the local covariance matrices using a distributed approach. The results show that it is exactly same as centralized method with good speed-up in terms of computation. The reason for speed-up is because of the parallel construction of local covariances and distributing the cross-covariances among the nodes so that the load is balanced. The results are analyzed by considering Mfeat data set on the various partitions which address the scalability also.
△ Less
Submitted 23 June, 2016;
originally announced June 2016.
-
Improving the detection accuracy of unknown malware by partitioning the executables in groups
Authors:
Ashu Sharma,
Sanjay K. Sahay,
Abhishek Kumar
Abstract:
Detection of unknown malware with high accuracy is always a challenging task. Therefore, in this paper, we study the classification of unknown malware by two methods. In the first/regular method, similar to other authors [17][16][20] approaches we select the features by taking all dataset in one group and in the second method, we select the features by partitioning the dataset in the range of file…
▽ More
Detection of unknown malware with high accuracy is always a challenging task. Therefore, in this paper, we study the classification of unknown malware by two methods. In the first/regular method, similar to other authors [17][16][20] approaches we select the features by taking all dataset in one group and in the second method, we select the features by partitioning the dataset in the range of file 5 KB size. We find that the second method to detect the malware with ~8.7% more accurate than the first/regular method.
△ Less
Submitted 22 June, 2016;
originally announced June 2016.
-
Grou** the executables to detect malware with high accuracy
Authors:
Sanjay K. Sahay,
Ashu Sharma
Abstract:
The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups whic…
▽ More
The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups which can identify their respective families. In addition, the grou** criteria should be chosen such a way that, it can also be applied to unknown files encounter on computers for classification. This paper discusses the study of malware and benign executables in groups to detect unknown malware with high accuracy. We studied sizes of malware generated by three popular second generation malware (metamorphic malware) creator kits viz. G2, PS-MPC and NGVCK, and observed that the size variation in any two generated malware from same kit is not much. Hence, we grouped the executables on the basis of malware sizes by using Optimal k-Means Clustering algorithm and used these obtained groups to select promising features for training (Random forest, J48, LMT, FT and NBT) classifiers to detect variants of malware or unknown malware. We find that detection of malware on the basis of their respected file sizes gives accuracy up to 99.11% from the classifiers.
△ Less
Submitted 22 June, 2016;
originally announced June 2016.
-
An effective approach for classification of advanced malware with high accuracy
Authors:
Ashu Sharma,
Sanjay K. Sahay
Abstract:
Combating malware is very important for software/systems security, but to prevent the software/systems from the advanced malware, viz. metamorphic malware is a challenging task, as it changes the structure/code after each infection. Therefore in this paper, we present a novel approach to detect the advanced malware with high accuracy by analyzing the occurrence of opcodes (features) by grou** th…
▽ More
Combating malware is very important for software/systems security, but to prevent the software/systems from the advanced malware, viz. metamorphic malware is a challenging task, as it changes the structure/code after each infection. Therefore in this paper, we present a novel approach to detect the advanced malware with high accuracy by analyzing the occurrence of opcodes (features) by grou** the executables. These groups are made on the basis of our earlier studies [1] that the difference between the sizes of any two malware generated by popular advanced malware kits viz. PS-MPC, G2 and NGVCK are within 5 KB. On the basis of obtained promising features, we studied the performance of thirteen classifiers using N-fold cross-validation available in machine learning tool WEKA. Among these thirteen classifiers we studied in-depth top five classifiers (Random forest, LMT, NBT, J48 and FT) and obtain more than 96.28% accuracy for the detection of unknown malware, which is better than the maximum detection accuracy (95.9%) reported by Santos et al (2013). In these top five classifiers, our approach obtained a detection accuracy of 97.95% by the Random forest.
△ Less
Submitted 22 June, 2016;
originally announced June 2016.
-
A Novel Approach to Distributed Multi-Class SVM
Authors:
Aruna Govada,
Shree Ranjani,
Aditi Viswanathan,
S. K. Sahay
Abstract:
With data sizes constantly expanding, and with classical machine learning algorithms that analyze such data requiring larger and larger amounts of computation time and storage space, the need to distribute computation and memory requirements among several computers has become apparent. Although substantial work has been done in develo** distributed binary SVM algorithms and multi-class SVM algor…
▽ More
With data sizes constantly expanding, and with classical machine learning algorithms that analyze such data requiring larger and larger amounts of computation time and storage space, the need to distribute computation and memory requirements among several computers has become apparent. Although substantial work has been done in develo** distributed binary SVM algorithms and multi-class SVM algorithms individually, the field of multi-class distributed SVMs remains largely unexplored. This research proposes a novel algorithm that implements the Support Vector Machine over a multi-class dataset and is efficient in a distributed environment (here, Hadoop). The idea is to divide the dataset into half recursively and thus compute the optimal Support Vector Machine for this half during the training phase, much like a divide and conquer approach. While testing, this structure has been effectively exploited to significantly reduce the prediction time. Our algorithm has shown better computation time during the prediction phase than the traditional sequential SVM methods (One vs. One, One vs. Rest) and out-performs them as the size of the dataset grows. This approach also classifies the data with higher accuracy than the traditional multi-class algorithms.
△ Less
Submitted 7 December, 2015;
originally announced December 2015.
-
Hybrid Approach for Inductive Semi Supervised Learning using Label Propagation and Support Vector Machine
Authors:
Aruna Govada,
Pravin Joshi,
Sahil Mittal,
Sanjay K Sahay
Abstract:
Semi supervised learning methods have gained importance in today's world because of large expenses and time involved in labeling the unlabeled data by human experts. The proposed hybrid approach uses SVM and Label Propagation to label the unlabeled data. In the process, at each step SVM is trained to minimize the error and thus improve the prediction quality. Experiments are conducted by using SVM…
▽ More
Semi supervised learning methods have gained importance in today's world because of large expenses and time involved in labeling the unlabeled data by human experts. The proposed hybrid approach uses SVM and Label Propagation to label the unlabeled data. In the process, at each step SVM is trained to minimize the error and thus improve the prediction quality. Experiments are conducted by using SVM and logistic regression(Logreg). Results prove that SVM performs tremendously better than Logreg. The approach is tested using 12 datasets of different sizes ranging from the order of 1000s to the order of 10000s. Results show that the proposed approach outperforms Label Propagation by a large margin with F-measure of almost twice on average. The parallel version of the proposed approach is also designed and implemented, the analysis shows that the training time decreases significantly when parallel version is used.
△ Less
Submitted 2 December, 2015;
originally announced December 2015.
-
Distributed Multi Class SVM for Large Data Sets
Authors:
Aruna Govada,
Bhavul Gauri,
S. K. Sahay
Abstract:
Data mining algorithms are originally designed by assuming the data is available at one centralized site.These algorithms also assume that the whole data is fit into main memory while running the algorithm. But in today's scenario the data has to be handled is distributed even geographically. Bringing the data into a centralized site is a bottleneck in terms of the bandwidth when compared with the…
▽ More
Data mining algorithms are originally designed by assuming the data is available at one centralized site.These algorithms also assume that the whole data is fit into main memory while running the algorithm. But in today's scenario the data has to be handled is distributed even geographically. Bringing the data into a centralized site is a bottleneck in terms of the bandwidth when compared with the size of the data. In this paper for multiclass SVM we propose an algorithm which builds a global SVM model by merging the local SVMs using a distributed approach(DSVM). And the global SVM will be communicated to each site and made it available for further classification. The experimental analysis has shown promising results with better accuracy when compared with both the centralized and ensemble method. The time complexity is also reduced drastically because of the parallel construction of local SVMs. The experiments are conducted by considering the data sets of size 100s to hundred of 100s which also addresses the issue of scalability.
△ Less
Submitted 2 December, 2015;
originally announced December 2015.
-
Centroid Based Binary Tree Structured SVM for Multi Classification
Authors:
Aruna Govada,
Bhavul Gauri,
S. K. Sahay
Abstract:
Support Vector Machines (SVMs) were primarily designed for 2-class classification. But they have been extended for N-class classification also based on the requirement of multiclasses in the practical applications. Although N-class classification using SVM has considerable research attention, getting minimum number of classifiers at the time of training and testing is still a continuing research.…
▽ More
Support Vector Machines (SVMs) were primarily designed for 2-class classification. But they have been extended for N-class classification also based on the requirement of multiclasses in the practical applications. Although N-class classification using SVM has considerable research attention, getting minimum number of classifiers at the time of training and testing is still a continuing research. We propose a new algorithm CBTS-SVM (Centroid based Binary Tree Structured SVM) which addresses this issue. In this we build a binary tree of SVM models based on the similarity of the class labels by finding their distance from the corresponding centroids at the root level. The experimental results demonstrates the comparable accuracy for CBTS with OVO with reasonable gamma and cost values. On the other hand when CBTS is compared with OVA, it gives the better accuracy with reduced training time and testing time. Furthermore CBTS is also scalable as it is able to handle the large data sets.
△ Less
Submitted 2 December, 2015;
originally announced December 2015.
-
Automated Document Indexing via Intelligent Hierarchical Clustering: A Novel Approach
Authors:
Rajendra Kumar Roul,
Shubham Rohan Asthana,
Sanjay Kumar Sahay
Abstract:
With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical clustering algorithm to generate an index over a set of documents. The framework has been designed to be scalable and accurate even with large corpora. The adva…
▽ More
With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical clustering algorithm to generate an index over a set of documents. The framework has been designed to be scalable and accurate even with large corpora. The advantage of the proposed algorithm lies in the need for minimal inputs, with much of the hierarchy attributes being decided in an automated manner using statistical methods. The use of topic modeling in a pre-processing stage ensures robustness to a range of variations in the input data. For experimental work 20-Newsgroups dataset has been used. The F- measure of the proposed approach has been compared with the traditional K-Means and K-Medoids clustering algorithms. Test results demonstrate the applicability, efficiency and effectiveness of our proposed approach. After extensive experimentation, we conclude that the framework shows promise for further research and specialized commercial applications.
△ Less
Submitted 1 April, 2015;
originally announced April 2015.
-
A Novel Modified Apriori Approach for Web Document Clustering
Authors:
Rajendra Kumar Roul,
Saransh Varshneya,
Ashu Kalra,
Sanjay Kumar Sahay
Abstract:
The traditional apriori algorithm can be used for clustering the web documents based on the association technique of data mining. But this algorithm has several limitations due to repeated database scans and its weak association rule analysis. In modern world of large databases, efficiency of traditional apriori algorithm would reduce manifolds. In this paper, we proposed a new modified apriori ap…
▽ More
The traditional apriori algorithm can be used for clustering the web documents based on the association technique of data mining. But this algorithm has several limitations due to repeated database scans and its weak association rule analysis. In modern world of large databases, efficiency of traditional apriori algorithm would reduce manifolds. In this paper, we proposed a new modified apriori approach by cutting down the repeated database scans and improving association analysis of traditional apriori algorithm to cluster the web documents. Further we improve those clusters by applying Fuzzy C-Means (FCM), K-Means and Vector Space Model (VSM) techniques separately. For experimental purpose, we use Classic3 and Classic4 datasets of Cornell University having more than 10,000 documents and run both traditional apriori and our modified apriori approach on it. Experimental results show that our approach outperforms the traditional apriori algorithm in terms of database scan and improvement on association of analysis. We found out that FCM is better than K-Means and VSM in terms of F-measure of clusters of different sizes.
△ Less
Submitted 29 March, 2015;
originally announced March 2015.
-
Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey
Authors:
Ashu Sharma,
S. K. Sahay
Abstract:
Malwares are big threat to digital world and evolving with high complexity. It can penetrate networks, steal confidential information from computers, bring down servers and can cripple infrastructures etc. To combat the threat/attacks from the malwares, anti- malwares have been developed. The existing anti-malwares are mostly based on the assumption that the malware structure does not changes appr…
▽ More
Malwares are big threat to digital world and evolving with high complexity. It can penetrate networks, steal confidential information from computers, bring down servers and can cripple infrastructures etc. To combat the threat/attacks from the malwares, anti- malwares have been developed. The existing anti-malwares are mostly based on the assumption that the malware structure does not changes appreciably. But the recent advancement in second generation malwares can create variants and hence posed a challenge to anti-malwares developers. To combat the threat/attacks from the second generation malwares with low false alarm we present our survey on malwares and its detection techniques.
△ Less
Submitted 2 December, 2015; v1 submitted 27 June, 2014;
originally announced June 2014.
-
Web Document Clustering and Ranking using Tf-Idf based Apriori Approach
Authors:
R. K. Roul,
O. R. Devanand,
S. K. Sahay
Abstract:
The dynamic web has increased exponentially over the past few years with more than thousands of documents related to a subject available to the user now. Most of the web documents are unstructured and not in an organized manner and hence user facing more difficult to find relevant documents. A more useful and efficient mechanism is combining clustering with ranking, where clustering can group the…
▽ More
The dynamic web has increased exponentially over the past few years with more than thousands of documents related to a subject available to the user now. Most of the web documents are unstructured and not in an organized manner and hence user facing more difficult to find relevant documents. A more useful and efficient mechanism is combining clustering with ranking, where clustering can group the similar documents in one place and ranking can be applied to each cluster for viewing the top documents at the beginning.. Besides the particular clustering algorithm, the different term weighting functions applied to the selected features to represent web document is a main aspect in clustering task. Kee** this approach in mind, here we proposed a new mechanism called Tf-Idf based Apriori for clustering the web documents. We then rank the documents in each cluster using Tf-Idf and similarity factor of documents based on the user query. This approach will helps the user to get all his relevant documents in one place and can restrict his search to some top documents of his choice. For experimental purpose, we have taken the Classic3 and Classic4 datasets of Cornell University having more than 10,000 documents and use gensim toolkit to carry out our work. We have compared our approach with traditional apriori algorithm and found that our approach is giving better results for higher minimum support. Our ranking mechanism is also giving a good F-measure of 78%.
△ Less
Submitted 21 June, 2014;
originally announced June 2014.
-
An Effective Approach for Web Document Classification using the Concept of Association Analysis of Data Mining
Authors:
R. K. Roul,
S. K. Sahay
Abstract:
Exponential growth of the web increased the importance of web document classification and data mining. To get the exact information, in the form of knowing what classes a web document belongs to, is expensive. Automatic classification of web document is of great use to search engines which provides this information at a low cost. In this paper, we propose an approach for classifying the web docume…
▽ More
Exponential growth of the web increased the importance of web document classification and data mining. To get the exact information, in the form of knowing what classes a web document belongs to, is expensive. Automatic classification of web document is of great use to search engines which provides this information at a low cost. In this paper, we propose an approach for classifying the web document using the frequent item word sets generated by the Frequent Pattern (FP) Growth which is an association analysis technique of data mining. These set of associated words act as feature set. The final classification obtained after Naïve Bayes classifier used on the feature set. For the experimental work, we use Gensim package, as it is simple and robust. Results show that our approach can be effectively classifying the web document.
△ Less
Submitted 21 June, 2014;
originally announced June 2014.
-
An effective web document clustering for information retrieval
Authors:
R. K. Roul,
S. K. Sahay
Abstract:
The size of web has increased exponentially over the past few years with thousands of documents related to a subject available to the user. With this much amount of information available, it is not possible to take the full advantage of the World Wide Web without having a proper framework to search through the available data. This requisite organization can be done in many ways. In this paper we i…
▽ More
The size of web has increased exponentially over the past few years with thousands of documents related to a subject available to the user. With this much amount of information available, it is not possible to take the full advantage of the World Wide Web without having a proper framework to search through the available data. This requisite organization can be done in many ways. In this paper we introduce a combine approach to cluster the web pages which first finds the frequent sets and then clusters the documents. These frequent sets are generated by using Frequent Pattern growth technique. Then by applying Fuzzy C- Means algorithm on it, we found clusters having documents which are highly related and have similar features. We used Gensim package to implement our approach because of its simplicity and robust nature. We have compared our results with the combine approach of (Frequent Pattern growth, K-means) and (Frequent Pattern growth, Cosine_Similarity). Experimental results show that our approach is more efficient then the above two combine approach and can handles more efficiently the serious limitation of traditional Fuzzy C-Means algorithm, which is sensitiveto initial centroid and the number of clusters to be formed.
△ Less
Submitted 5 November, 2012;
originally announced November 2012.
-
An Effective Information Retrieval for Ambiguous Query
Authors:
R. K. Roul,
S. K. Sahay
Abstract:
Search engine returns thousands of web pages for a single user query, in which most of them are not relevant. In this context, effective information retrieval from the expanding web is a challenging task, in particular, if the query is ambiguous. The major question arises here is that how to get the relevant pages for an ambiguous query. We propose an approach for the effective result of an ambigu…
▽ More
Search engine returns thousands of web pages for a single user query, in which most of them are not relevant. In this context, effective information retrieval from the expanding web is a challenging task, in particular, if the query is ambiguous. The major question arises here is that how to get the relevant pages for an ambiguous query. We propose an approach for the effective result of an ambiguous query by forming community vector based on association concept of data minning using vector space model and the freedictionary. We develop clusters by computing the similarity between community vectors and document vectors formed from the extracted web pages by the search engine. We use Gensim package to implement the algorithm because of its simplicity and robust nature. Analysis shows that our approach is an effective way to form clusters for an ambiguous query.
△ Less
Submitted 6 April, 2012;
originally announced April 2012.
-
On the independent points in the sky for the search of periodic gravitational wave
Authors:
S. K. Sahay
Abstract:
We investigate the independent points in the sky require to search the periodic gravitational wave, assuming the noise power spectral density to be flat. We have made an analysis with different initial azimuth of the Earth for a week data set. The analysis shows significant difference in the independent points in the sky for the search. We numerically obtain an approximate relation to make trade…
▽ More
We investigate the independent points in the sky require to search the periodic gravitational wave, assuming the noise power spectral density to be flat. We have made an analysis with different initial azimuth of the Earth for a week data set. The analysis shows significant difference in the independent points in the sky for the search. We numerically obtain an approximate relation to make trade-off between computational cost and sensitivities. We also discuss the feasibility of the coherent search in small frequency band in reference to advanced LIGO.
△ Less
Submitted 5 March, 2008;
originally announced March 2008.
-
Earth azimuth effect in the bank of search templates for an all sky search of the continuous gravitational wave
Authors:
S. K. Sahay
Abstract:
We study the problem of all sky search in reference to continuous gravitational wave (CGW) whose wave-form are known in advance. We employ the concept of Fitting Factor and study the variation in the bank of search templates with different Earth azimuth at t=0. We found that the number of search templates varies significantly. Hence, accordingly, the computational demand for the search may be re…
▽ More
We study the problem of all sky search in reference to continuous gravitational wave (CGW) whose wave-form are known in advance. We employ the concept of Fitting Factor and study the variation in the bank of search templates with different Earth azimuth at t=0. We found that the number of search templates varies significantly. Hence, accordingly, the computational demand for the search may be reduced up to two orders by time shifting the data.
△ Less
Submitted 28 March, 2006;
originally announced March 2006.
-
Matching of the continuous gravitational wave in an all sky search
Authors:
S. K. Sahay
Abstract:
We investigate the matching of continuous gravitational wave (CGW) signals in an all sky search with reference to Earth based laser interferometric detectors. We consider the source location as the parameters of the signal manifold and templates corresponding to different source locations. It has been found that the matching of signals from locations in the sky that differ in their co-latitude a…
▽ More
We investigate the matching of continuous gravitational wave (CGW) signals in an all sky search with reference to Earth based laser interferometric detectors. We consider the source location as the parameters of the signal manifold and templates corresponding to different source locations. It has been found that the matching of signals from locations in the sky that differ in their co-latitude and longitude by $π$ radians decreases with source frequency. We have also made an analysis with the other parameters affecting the symmetries. We observe that it may not be relevant to take care of the symmetries in the sky locations for the search of CGW from the output of LIGO-I, GEO600 and TAMA detectors.
△ Less
Submitted 24 March, 2003; v1 submitted 31 December, 2002;
originally announced December 2002.
-
Studies in Gravitational Wave Data Analysis
Authors:
S. K. Sahay
Abstract:
This thesis is devoted to the investigations of gravitational wave (GW) data analysis from a continuous source e.g. a pulsar, a binary star system. The first Chapter is an introduction to gravitational wave and second Chapter is on the data analysis concept for the detection of GW. In third Chapter we developed the Fourier Transform (FT) of a continuous gravitational wave (CGW) for ground based…
▽ More
This thesis is devoted to the investigations of gravitational wave (GW) data analysis from a continuous source e.g. a pulsar, a binary star system. The first Chapter is an introduction to gravitational wave and second Chapter is on the data analysis concept for the detection of GW. In third Chapter we developed the Fourier Transform (FT) of a continuous gravitational wave (CGW) for ground based laser interferometric detectors for the data set of one day observation time incorporating the effects arising due to rotational as well as orbital motion of the earth. The transform is applicable for arbitrary location of detector and source. In Chapter four we have generalized the FT for the data set for (i) one year observation time and (ii) arbitrary observation time. As an application of the transform we considered spin down and N-component signal analysis. In fifth Chapter we have made an analysis of the number of templates required for matched filter analysis as applicable to these sources. We have employed the concept of {\it Fitting Factor (FF)}; treating the source location as the parameters of the signal manifold and have studied the matching of the signal with templates corresponding to different source locations. We have investigated the variation of {\it FF} with source location and have noticed a symmetry in template parameters, $θ_T$ and $φ_T$. It has been found that the two different template values in source location, each in $θ_T$ and $φ_T$, have same {\it FF}. We have also computed the number of templates required assuming the noise power spectral density $S_n(f)$ to be flat. It is observed that higher {\it FF} requires exponentially increasing large number of templates. Appendix contains the source codes developed for the computation.
△ Less
Submitted 6 September, 2002;
originally announced September 2002.
-
Data analysis of continuous gravitational wave: All sky search and study of templates
Authors:
D. C. Srivastava,
S. K. Sahay
Abstract:
We have studied the problem of all sky search in reference to continuous gravitational wave particularly for such sources whose wave-form are known in advance. We have made an analysis of the number of templates required for matched filter analysis as applicable to these sources. We have employed the concept of {\it fitting factor} {\it (FF)}; treating the source location as the parameters of th…
▽ More
We have studied the problem of all sky search in reference to continuous gravitational wave particularly for such sources whose wave-form are known in advance. We have made an analysis of the number of templates required for matched filter analysis as applicable to these sources. We have employed the concept of {\it fitting factor} {\it (FF)}; treating the source location as the parameters of the signal manifold and have studied the matching of the signal with templates corresponding to different source locations. We have investigated the variation of FF with source location and have noticed a symmetry in template parameters, $θ_T$ and $φ_T$. It has been found that the two different template values in source location, each in $θ_T$ and $φ_T$, have same {\it FF}. We have also computed the number of templates required assuming the noise power spectral density $S_n(f)$ to be flat. It is observed that higher {\it FF} requires exponentially increasing large number of templates.
△ Less
Submitted 3 September, 2002;
originally announced September 2002.
-
Data analysis of continuous gravitational wave: Fourier transform-II
Authors:
D. C. Srivastava,
S. K. Sahay
Abstract:
In this paper we obtain the Fourier Transform of a continuous gravitational wave. We have analysed the data set for (i) one year observation time and (ii) arbitrary observation time, for arbitrary location of detector and source taking into account the effects arising due to rotational as well as orbital motion of the earth. As an application of the transform we considered spin down and N-compon…
▽ More
In this paper we obtain the Fourier Transform of a continuous gravitational wave. We have analysed the data set for (i) one year observation time and (ii) arbitrary observation time, for arbitrary location of detector and source taking into account the effects arising due to rotational as well as orbital motion of the earth. As an application of the transform we considered spin down and N-component signal analysis.
△ Less
Submitted 30 July, 2002;
originally announced July 2002.
-
Data analysis of continuous gravitational wave: Fourier transform-I
Authors:
D. C. Srivastava,
S. K. Sahay
Abstract:
We present the Fourier Transform of a continuous gravitational wave. We have analysed the data set for one day observation time and our analysis is applicable for arbitrary location of detector and source. We have taken into account the effects arising due to rotational as well as orbital motions of the earth.
We present the Fourier Transform of a continuous gravitational wave. We have analysed the data set for one day observation time and our analysis is applicable for arbitrary location of detector and source. We have taken into account the effects arising due to rotational as well as orbital motions of the earth.
△ Less
Submitted 3 September, 2002; v1 submitted 16 November, 2001;
originally announced November 2001.
-
Data Analysis of Continuous Gravitational Wave Signal: Fourier Transform
Authors:
D. C. Srivastava,
S. K. Sahay
Abstract:
We present the Fourier Transform of continuous gravitational wave for arbitrary location of detector and source and for any duration of observation time in which both rotational motion of earth about its spin axis and orbital motion around sun has been taken into account. We also give the method to account the spin down of continuous gravitational wave.
We present the Fourier Transform of continuous gravitational wave for arbitrary location of detector and source and for any duration of observation time in which both rotational motion of earth about its spin axis and orbital motion around sun has been taken into account. We also give the method to account the spin down of continuous gravitational wave.
△ Less
Submitted 27 September, 2000;
originally announced September 2000.