-
Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions
Authors:
Rudolf Ferenc,
Péter Hegedűs,
Péter Gyimesi,
Gábor Antal,
Dénes Bán,
Tibor Gyimóthy
Abstract:
The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-th…
▽ More
The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Static JavaScript Call Graphs: A Comparative Study
Authors:
Gábor Antal,
Péter Hegedűs,
Zoltán Tóth,
Rudolf Ferenc,
Tibor Gyimóthy
Abstract:
The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for ca…
▽ More
The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for call graph construction as they do not require extensive test beds for programs and their costly execution and tracing. In this paper, we systematically compare five widely adopted static algorithms - implemented by the npm call graph, IBM WALA, Google Closure Compiler, Approximate Call Graph, and Type Analyzer for JavaScript tools - for building JavaScript call graphs on 26 WebKit SunSpider benchmark programs and 6 real-world Node.js modules. We provide a performance analysis as well as a quantitative and qualitative evaluation of the results. We found that there was a relatively large intersection of the found call edges among the algorithms, which proved to be 100 precise. However, most of the tools found edges that were missed by all others. ACG had the highest precision followed immediately by TAJS, but ACG found significantly more call edges. As for the combination of tools, ACG and TAJS together covered 99% of the found true edges by all algorithms, while maintaining a precision as high as 98%. Only two of the tools were able to analyze up-to-date multi-file Node.js modules due to incomplete language features support. They agreed on almost 60% of the call edges, but each of them found valid edges that the other missed.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction
Authors:
Rudolf Ferenc,
Péter Gyimesi,
Gábor Gyimesi,
Zoltán Tóth,
Tibor Gyimóthy
Abstract:
Bugs are inescapable during software development due to frequent code changes, tight deadlines, etc.; therefore, it is important to have tools to find these errors. One way of performing bug identification is to analyze the characteristics of buggy source code elements from the past and predict the present ones based on the same characteristics, using e.g. machine learning models. To support model…
▽ More
Bugs are inescapable during software development due to frequent code changes, tight deadlines, etc.; therefore, it is important to have tools to find these errors. One way of performing bug identification is to analyze the characteristics of buggy source code elements from the past and predict the present ones based on the same characteristics, using e.g. machine learning models. To support model building tasks, code elements and their characteristics are collected in so-called bug datasets which serve as the input for learning.
We present the \emph{BugHunter Dataset}: a novel kind of automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information. Other available bug datasets follow the traditional approach of gathering the characteristics of all source code elements (buggy and non-buggy) at only one or more pre-selected release versions of the code. Our approach, on the other hand, captures the buggy and the fixed states of the same source code elements from the narrowest timeframe we can identify for a bug's presence, regardless of release versions. To show the usefulness of the new dataset, we built and evaluated bug prediction models and achieved F-measure values over 0.74.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
Slicing of Constraint Logic Programs
Authors:
Gyongyi Szilagyi,
Tibor Gyimothy,
Jan Maluszynski
Abstract:
Slicing is a program analysis technique originally developed for imperative languages. It facilitates understanding of data flow and debugging.
This paper discusses slicing of Constraint Logic Programs. Constraint Logic Programming (CLP) is an emerging software technology with a growing number of applications. Data flow in constraint programs is not explicit, and for this reason the concepts o…
▽ More
Slicing is a program analysis technique originally developed for imperative languages. It facilitates understanding of data flow and debugging.
This paper discusses slicing of Constraint Logic Programs. Constraint Logic Programming (CLP) is an emerging software technology with a growing number of applications. Data flow in constraint programs is not explicit, and for this reason the concepts of slice and the slicing techniques of imperative languages are not directly applicable.
This paper formulates declarative notions of slice suitable for CLP. They provide a basis for defining slicing techniques (both dynamic and static) based on variable sharing. The techniques are further extended by using groundness information.
A prototype dynamic slicer of CLP programs implementing the presented ideas is briefly described together with the results of some slicing experiments.
△ Less
Submitted 18 December, 2000;
originally announced December 2000.