Search | arXiv e-print repository

A Data-Mining Based Study of Security Vulnerability Types and Their Mitigation in Different Languages

Authors: Gábor Antal, Balázs Mosolygó, Norbert Vándor, Péter Hegedüs

Abstract: The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these… ▽ More The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these languages. Analyzing the more popular ones, we found that the same security issues might appear differently in different languages, and as such the provided solutions may vary just as much. We also found that projects with similar sizes can produce extremely different results, and have different common weaknesses, even if they provide a solution to the same task. These statistics may not be entirely indicative of the projects' standards when it comes to security, but they provide a good reference point of what one should expect. Given a larger sample size they could be made even more precise, and as such a better understanding of the security relevant activities within the projects written in given languages could be achieved. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07244 [pdf, other]

Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics

Authors: Gábor Antal, Zoltán Tóth, Péter Hegedűs, Rudolf Ferenc

Abstract: Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript b… ▽ More Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript bug prediction model based on static source code metrics with the addition of a hybrid (static and dynamic) code analysis based metric of the number of incoming and outgoing function calls (HNII and HNOI). Our motivation for this is that JavaScript is a highly dynamic scripting language for which static code analysis might be very imprecise; therefore, using a purely static source code features for bug prediction might not be enough. Based on a study where we extracted 824 buggy and 1943 non-buggy functions from the publicly available BugsJS dataset for the ESLint JavaScript project, we can confirm the positive impact of hybrid code metrics on the prediction performance of the ML models. Depending on the ML algorithm, applied hyper-parameters, and target measures we consider, hybrid invocation metrics bring a 2-10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances; however, using them all together yields the best results. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07213 [pdf, other]

Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions

Authors: Rudolf Ferenc, Péter Hegedűs, Péter Gyimesi, Gábor Antal, Dénes Bán, Tibor Gyimóthy

Abstract: The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-th… ▽ More The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07206 [pdf, other]

Static JavaScript Call Graphs: A Comparative Study

Authors: Gábor Antal, Péter Hegedűs, Zoltán Tóth, Rudolf Ferenc, Tibor Gyimóthy

Abstract: The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for ca… ▽ More The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for call graph construction as they do not require extensive test beds for programs and their costly execution and tracing. In this paper, we systematically compare five widely adopted static algorithms - implemented by the npm call graph, IBM WALA, Google Closure Compiler, Approximate Call Graph, and Type Analyzer for JavaScript tools - for building JavaScript call graphs on 26 WebKit SunSpider benchmark programs and 6 real-world Node.js modules. We provide a performance analysis as well as a quantitative and qualitative evaluation of the results. We found that there was a relatively large intersection of the found call edges among the algorithms, which proved to be 100 precise. However, most of the tools found edges that were missed by all others. ACG had the highest precision followed immediately by TAJS, but ACG found significantly more call edges. As for the combination of tools, ACG and TAJS together covered 99% of the found true edges by all algorithms, while maintaining a precision as high as 98%. Only two of the tools were able to analyze up-to-date multi-file Node.js modules due to incomplete language features support. They agreed on almost 60% of the call edges, but each of them found valid edges that the other missed. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2309.00687 [pdf, ps, other]

On Linear Codes with Random Multiplier Vectors and the Maximum Trace Dimension Property

Authors: Márton Erdélyi, Pál Hegedüs, Sándor Z. Kiss, Gábor P. Nagy

Abstract: Let $C$ be a linear code of length $n$ and dimension $k$ over the finite field $\mathbb{F}_{q^m}$. The trace code $\mathrm{Tr}(C)$ is a linear code of the same length $n$ over the subfield $\mathbb{F}_q$. The obvious upper bound for the dimension of the trace code over $\mathbb{F}_q$ is $mk$. If equality holds, then we say that $C$ has maximum trace dimension. The problem of finding the true dimen… ▽ More Let $C$ be a linear code of length $n$ and dimension $k$ over the finite field $\mathbb{F}_{q^m}$. The trace code $\mathrm{Tr}(C)$ is a linear code of the same length $n$ over the subfield $\mathbb{F}_q$. The obvious upper bound for the dimension of the trace code over $\mathbb{F}_q$ is $mk$. If equality holds, then we say that $C$ has maximum trace dimension. The problem of finding the true dimension of trace codes and their duals is relevant for the size of the public key of various code-based cryptographic protocols. Let $C_{\mathbf{a}}$ denote the code obtained from $C$ and a multiplier vector $\mathbf{a}\in (\mathbb{F}_{q^m})^n$. In this paper, we give a lower bound for the probability that a random multiplier vector produces a code $C_{\mathbf{a}}$ of maximum trace dimension. We give an interpretation of the bound for the class of algebraic geometry codes in terms of the degree of the defining divisor. The bound explains the experimental fact that random alternant codes have minimal dimension. Our bound holds whenever $n\geq m(k+h)$, where $h\geq 0$ is the Singleton defect of $C$. For the extremal case $n=m(h+k)$, numerical experiments reveal a closed connection between the probability of having maximum trace dimension and the probability that a random matrix has full rank. △ Less

Submitted 1 September, 2023; originally announced September 2023.

MSC Class: 11T71; 15A03

arXiv:2303.16591 [pdf, other]

An AST-based Code Change Representation and its Performance in Just-in-time Vulnerability Prediction

Authors: Tamás Aladics, Péter Hegedűs, Rudolf Ferenc

Abstract: The presence of software vulnerabilities is an ever-growing issue in software development. In most cases, it is desirable to detect vulnerabilities as early as possible, preferably in a just-in-time manner, when the vulnerable piece is added to the code base. The industry has a hard time combating this problem as manual inspection is costly and traditional means, such as rule-based bug detection,… ▽ More The presence of software vulnerabilities is an ever-growing issue in software development. In most cases, it is desirable to detect vulnerabilities as early as possible, preferably in a just-in-time manner, when the vulnerable piece is added to the code base. The industry has a hard time combating this problem as manual inspection is costly and traditional means, such as rule-based bug detection, are not robust enough to follow the pace of the emergence of new vulnerabilities. The actively researched field of machine learning could help in such situations as models can be trained to detect vulnerable patterns. However, machine learning models work well only if the data is appropriately represented. In our work, we propose a novel way of representing changes in source code (i.e. code commits), the Code Change Tree, a form that is designed to keep only the differences between two abstract syntax trees of Java source code. We compared its effectiveness in predicting if a code change introduces a vulnerability against multiple representation types and evaluated them by a number of machine learning models as a baseline. The evaluation is done on a novel dataset that we published as part of our contributions using a 2-phase dataset generator method. Based on our evaluation we concluded that using Code Change Tree is a valid and effective choice to represent source code changes as it improves performance. △ Less

Submitted 29 March, 2023; originally announced March 2023.

arXiv:2108.02044 [pdf, ps, other]

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Authors: Amirreza Bagheri, Péter Hegedűs

Abstract: In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues.… ▽ More In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques for solving Software Engineering (SE) problems is rarely studied systematically. In this paper, we present a comparative study on three popular text representation methods, word2vec, fastText, and BERT applied to the SE task of detecting vulnerabilities in Python code. Using a data mining approach, we collected a large volume of Python source code in both vulnerable and fixed forms that we embedded with word2vec, fastText, and BERT to vectors and used a Long Short-Term Memory network to train on them. Using the same LSTM architecture, we could compare the efficiency of the different embeddings in deriving meaningful feature vectors. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy(93.8%) in predicting Python source code vulnerabilities. △ Less

Submitted 4 August, 2021; originally announced August 2021.

arXiv:2105.07527 [pdf, other]

Improving Vulnerability Prediction of JavaScript Functions Using Process Metrics

Authors: Tamás Viszkok, Péter Hegedűs, Rudolf Ferenc

Abstract: Due to the growing number of cyber attacks against computer systems, we need to pay special attention to the security of our software systems. In order to maximize the effectiveness, excluding the human component from this process would be a huge breakthrough. The first step towards this is to automatically recognize the vulnerable parts in our code. Researchers put a lot of effort into creating m… ▽ More Due to the growing number of cyber attacks against computer systems, we need to pay special attention to the security of our software systems. In order to maximize the effectiveness, excluding the human component from this process would be a huge breakthrough. The first step towards this is to automatically recognize the vulnerable parts in our code. Researchers put a lot of effort into creating machine learning models that could determine if a given piece of code, or to be more precise, a selected function, contains any vulnerabilities or not. We aim at improving the existing models, building on previous results in predicting vulnerabilities at the level of functions in JavaScript code using the well-known static source code metrics. In this work, we propose to include several so-called process metrics (e.g., code churn, number of developers modifying a file, or the age of the changed source code) into the set of features, and examine how they affect the performance of the function-level JavaScript vulnerability prediction models. We can confirm that process metrics significantly improve the prediction power of such models. On average, we observed a 8.4% improvement in terms of F-measure (from 0.764 to 0.848), 3.5% improvement in terms of precision (from 0.953 to 0.988) and a 6.3% improvement in terms of recall (from 0.697 to 0.760). △ Less

Submitted 16 May, 2021; originally announced May 2021.

arXiv:2103.09604 [pdf, other]

On the Rise and Fall of Simple Stupid Bugs: a Life-Cycle Analysis of SStuBs

Authors: Balázs Mosolygó, Norbert Vándor, Gábor Antal, Péter Hegedűs

Abstract: Bug detection and prevention is one of the most important goals of software quality assurance. Nowadays, many of the major problems faced by developers can be detected or even fixed fully or partially with automatic tools. However, recent works explored that there exists a substantial amount of simple yet very annoying errors in code-bases, which are easy to fix, but hard to detect as they do not… ▽ More Bug detection and prevention is one of the most important goals of software quality assurance. Nowadays, many of the major problems faced by developers can be detected or even fixed fully or partially with automatic tools. However, recent works explored that there exists a substantial amount of simple yet very annoying errors in code-bases, which are easy to fix, but hard to detect as they do not hinder the functionality of the given product in a major way. Programmers introduce such errors accidentally, mostly due to inattention. Using the ManySStuBs4J dataset, which contains many simple, stupid bugs, found in GitHub repositories written in the Java programming language, we investigated the history of such bugs. We were interested in properties such as: How long do such bugs stay unnoticed in code-bases? Whether they are typically fixed by the same developer who introduced them? Are they introduced with the addition of new code or caused more by careless modification of existing code? We found that most of such stupid bugs lurk in the code for a long time before they get removed. We noticed that the developer who made the mistake seems to find a solution faster, however less then half of SStuBs are fixed by the same person. We also examined PMD's performance when to came to flagging lines containing SStuBs, and found that similarly to SpotBugs, it is insufficient when it comes to finding these types of errors. Examining the life-cycle of such bugs allows us to better understand their nature and adjust our development processes and quality assurance methods to better support avoiding them. △ Less

Submitted 17 March, 2021; originally announced March 2021.

arXiv:2011.01214 [pdf, other]

Employing Partial Least Squares Regression with Discriminant Analysis for Bug Prediction

Authors: Rudolf Ferenc, István Siket, Péter Hegedűs, Róbert Rajkó

Abstract: Forecasting defect proneness of source code has long been a major research concern. Having an estimation of those parts of a software system that most likely contain bugs may help focus testing efforts, reduce costs, and improve product quality. Many prediction models and approaches have been introduced during the past decades that try to forecast bugged code elements based on static source code m… ▽ More Forecasting defect proneness of source code has long been a major research concern. Having an estimation of those parts of a software system that most likely contain bugs may help focus testing efforts, reduce costs, and improve product quality. Many prediction models and approaches have been introduced during the past decades that try to forecast bugged code elements based on static source code metrics, change and history metrics, or both. However, there is still no universal best solution to this problem, as most suitable features and models vary from dataset to dataset and depend on the context in which we use them. Therefore, novel approaches and further studies on this topic are highly necessary. In this paper, we employ a chemometric approach - Partial Least Squares with Discriminant Analysis (PLS-DA) - for predicting bug prone Classes in Java programs using static source code metrics. To our best knowledge, PLS-DA has never been used before as a statistical approach in the software maintenance domain for predicting software errors. In addition, we have used rigorous statistical treatments including bootstrap resampling and randomization (permutation) test, and evaluation for representing the software engineering results. We show that our PLS-DA based prediction model achieves superior performances compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when applying up-sampling on the largest open bug dataset, while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2006.13652 [pdf, other]

Exploring the Security Awareness of the Python and JavaScript Open Source Communities

Authors: Gábor Antal, Márton Keleti, Péter Hegedűs

Abstract: Software security is undoubtedly a major concern in today's software engineering. Although the level of awareness of security issues is often high, practical experiences show that neither preventive actions nor reactions to possible issues are always addressed properly in reality. By analyzing large quantities of commits in the open-source communities, we can categorize the vulnerabilities mitigat… ▽ More Software security is undoubtedly a major concern in today's software engineering. Although the level of awareness of security issues is often high, practical experiences show that neither preventive actions nor reactions to possible issues are always addressed properly in reality. By analyzing large quantities of commits in the open-source communities, we can categorize the vulnerabilities mitigated by the developers and study their distribution, resolution time, etc. to learn and improve security management processes and practices. With the help of the Software Heritage Graph Dataset, we investigated the commits of two of the most popular script languages -- Python and JavaScript -- projects collected from public repositories and identified those that mitigate a certain vulnerability in the code (i.e. vulnerability resolution commits). On the one hand, we identified the types of vulnerabilities (in terms of CWE groups) referred to in commit messages and compared their numbers within the two communities. On the other hand, we examined the average time elapsing between the publish date of a vulnerability and the first reference to it in a commit. We found that there is a large intersection in the vulnerability types mitigated by the two communities, but most prevalent vulnerabilities are specific to language. Moreover, neither the JavaScript nor the Python community reacts very fast to appearing security vulnerabilities in general with only a couple of exceptions for certain CWE groups. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: 17th International Conference on Mining Software Repositories

Showing 1–11 of 11 results for author: Hegedűs, P