Search | arXiv e-print repository

A Data-Mining Based Study of Security Vulnerability Types and Their Mitigation in Different Languages

Authors: Gábor Antal, Balázs Mosolygó, Norbert Vándor, Péter Hegedüs

Abstract: The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these… ▽ More The number of people accessing online services is increasing day by day, and with new users, comes a greater need for effective and responsive cyber-security. Our goal in this study was to find out if there are common patterns within the most widely used programming languages in terms of security issues and fixes. In this paper, we showcase some statistics based on the data we extracted for these languages. Analyzing the more popular ones, we found that the same security issues might appear differently in different languages, and as such the provided solutions may vary just as much. We also found that projects with similar sizes can produce extremely different results, and have different common weaknesses, even if they provide a solution to the same task. These statistics may not be entirely indicative of the projects' standards when it comes to security, but they provide a good reference point of what one should expect. Given a larger sample size they could be made even more precise, and as such a better understanding of the security relevant activities within the projects written in given languages could be achieved. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07244 [pdf, other]

Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics

Authors: Gábor Antal, Zoltán Tóth, Péter Hegedűs, Rudolf Ferenc

Abstract: Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript b… ▽ More Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript bug prediction model based on static source code metrics with the addition of a hybrid (static and dynamic) code analysis based metric of the number of incoming and outgoing function calls (HNII and HNOI). Our motivation for this is that JavaScript is a highly dynamic scripting language for which static code analysis might be very imprecise; therefore, using a purely static source code features for bug prediction might not be enough. Based on a study where we extracted 824 buggy and 1943 non-buggy functions from the publicly available BugsJS dataset for the ESLint JavaScript project, we can confirm the positive impact of hybrid code metrics on the prediction performance of the ML models. Depending on the ML algorithm, applied hyper-parameters, and target measures we consider, hybrid invocation metrics bring a 2-10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances; however, using them all together yields the best results. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07213 [pdf, other]

Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions

Authors: Rudolf Ferenc, Péter Hegedűs, Péter Gyimesi, Gábor Antal, Dénes Bán, Tibor Gyimóthy

Abstract: The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-th… ▽ More The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07206 [pdf, other]

Static JavaScript Call Graphs: A Comparative Study

Authors: Gábor Antal, Péter Hegedűs, Zoltán Tóth, Rudolf Ferenc, Tibor Gyimóthy

Abstract: The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for ca… ▽ More The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for call graph construction as they do not require extensive test beds for programs and their costly execution and tracing. In this paper, we systematically compare five widely adopted static algorithms - implemented by the npm call graph, IBM WALA, Google Closure Compiler, Approximate Call Graph, and Type Analyzer for JavaScript tools - for building JavaScript call graphs on 26 WebKit SunSpider benchmark programs and 6 real-world Node.js modules. We provide a performance analysis as well as a quantitative and qualitative evaluation of the results. We found that there was a relatively large intersection of the found call edges among the algorithms, which proved to be 100 precise. However, most of the tools found edges that were missed by all others. ACG had the highest precision followed immediately by TAJS, but ACG found significantly more call edges. As for the combination of tools, ACG and TAJS together covered 99% of the found true edges by all algorithms, while maintaining a precision as high as 98%. Only two of the tools were able to analyze up-to-date multi-file Node.js modules due to incomplete language features support. They agreed on almost 60% of the call edges, but each of them found valid edges that the other missed. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.07204 [pdf, other]

Transforming C++11 Code to C++03 to Support Legacy Compilation Environments

Authors: Gábor Antal, Dávid Havas, István Siket, Árpád Beszédes, Rudolf Ferenc, József Mihalicza

Abstract: Newer technologies - programming languages, environments, libraries - change very rapidly. However, various internal and external constraints often prevent projects from quickly adopting to these changes. Customers may require specific platform compatibility from a software vendor, for example. In this work, we deal with such an issue in the context of the C++ programming language. Our industrial… ▽ More Newer technologies - programming languages, environments, libraries - change very rapidly. However, various internal and external constraints often prevent projects from quickly adopting to these changes. Customers may require specific platform compatibility from a software vendor, for example. In this work, we deal with such an issue in the context of the C++ programming language. Our industrial partner is required to use SDKs that support only older C++ language editions. They, however, would like to allow their developers to use the newest language constructs in their code. To address this problem, we created a source code transformation framework to automatically backport source code written according to the C++11 standard to its functionally equivalent C++03 variant. With our framework developers are free to exploit the latest language features, while production code is still built by using a restricted set of available language constructs. This paper reports on the technical details of the transformation engine, and our experiences in applying it on two large industrial code bases and four open-source systems. Our solution is freely available and open-source. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2404.14370 [pdf, other]

Assessing GPT-4-Vision's Capabilities in UML-Based Code Generation

Authors: Gábor Antal, Richárd Vozár, Rudolf Ferenc

Abstract: The emergence of advanced neural networks has opened up new ways in automated code generation from conceptual models, promising to enhance software development processes. This paper presents a preliminary evaluation of GPT-4-Vision, a state-of-the-art deep learning model, and its capabilities in transforming Unified Modeling Language (UML) class diagrams into fully operating Java class files. In o… ▽ More The emergence of advanced neural networks has opened up new ways in automated code generation from conceptual models, promising to enhance software development processes. This paper presents a preliminary evaluation of GPT-4-Vision, a state-of-the-art deep learning model, and its capabilities in transforming Unified Modeling Language (UML) class diagrams into fully operating Java class files. In our study, we used exported images of 18 class diagrams comprising 10 single-class and 8 multi-class diagrams. We used 3 different prompts for each input, and we manually evaluated the results. We created a scoring system in which we scored the occurrence of elements found in the diagram within the source code. On average, the model was able to generate source code for 88% of the elements shown in the diagrams. Our results indicate that GPT-4-Vision exhibits proficiency in handling single-class UML diagrams, successfully transforming them into syntactically correct class files. However, for multi-class UML diagrams, the model's performance is weaker compared to single-class diagrams. In summary, further investigations are necessary to exploit the model's potential completely. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2103.09604 [pdf, other]

On the Rise and Fall of Simple Stupid Bugs: a Life-Cycle Analysis of SStuBs

Authors: Balázs Mosolygó, Norbert Vándor, Gábor Antal, Péter Hegedűs

Abstract: Bug detection and prevention is one of the most important goals of software quality assurance. Nowadays, many of the major problems faced by developers can be detected or even fixed fully or partially with automatic tools. However, recent works explored that there exists a substantial amount of simple yet very annoying errors in code-bases, which are easy to fix, but hard to detect as they do not… ▽ More Bug detection and prevention is one of the most important goals of software quality assurance. Nowadays, many of the major problems faced by developers can be detected or even fixed fully or partially with automatic tools. However, recent works explored that there exists a substantial amount of simple yet very annoying errors in code-bases, which are easy to fix, but hard to detect as they do not hinder the functionality of the given product in a major way. Programmers introduce such errors accidentally, mostly due to inattention. Using the ManySStuBs4J dataset, which contains many simple, stupid bugs, found in GitHub repositories written in the Java programming language, we investigated the history of such bugs. We were interested in properties such as: How long do such bugs stay unnoticed in code-bases? Whether they are typically fixed by the same developer who introduced them? Are they introduced with the addition of new code or caused more by careless modification of existing code? We found that most of such stupid bugs lurk in the code for a long time before they get removed. We noticed that the developer who made the mistake seems to find a solution faster, however less then half of SStuBs are fixed by the same person. We also examined PMD's performance when to came to flagging lines containing SStuBs, and found that similarly to SpotBugs, it is insufficient when it comes to finding these types of errors. Examining the life-cycle of such bugs allows us to better understand their nature and adjust our development processes and quality assurance methods to better support avoiding them. △ Less

Submitted 17 March, 2021; originally announced March 2021.

arXiv:2006.13652 [pdf, other]

Exploring the Security Awareness of the Python and JavaScript Open Source Communities

Authors: Gábor Antal, Márton Keleti, Péter Hegedűs

Abstract: Software security is undoubtedly a major concern in today's software engineering. Although the level of awareness of security issues is often high, practical experiences show that neither preventive actions nor reactions to possible issues are always addressed properly in reality. By analyzing large quantities of commits in the open-source communities, we can categorize the vulnerabilities mitigat… ▽ More Software security is undoubtedly a major concern in today's software engineering. Although the level of awareness of security issues is often high, practical experiences show that neither preventive actions nor reactions to possible issues are always addressed properly in reality. By analyzing large quantities of commits in the open-source communities, we can categorize the vulnerabilities mitigated by the developers and study their distribution, resolution time, etc. to learn and improve security management processes and practices. With the help of the Software Heritage Graph Dataset, we investigated the commits of two of the most popular script languages -- Python and JavaScript -- projects collected from public repositories and identified those that mitigate a certain vulnerability in the code (i.e. vulnerability resolution commits). On the one hand, we identified the types of vulnerabilities (in terms of CWE groups) referred to in commit messages and compared their numbers within the two communities. On the other hand, we examined the average time elapsing between the publish date of a vulnerability and the first reference to it in a commit. We found that there is a large intersection in the vulnerability types mitigated by the two communities, but most prevalent vulnerabilities are specific to language. Moreover, neither the JavaScript nor the Python community reacts very fast to appearing security vulnerabilities in general with only a couple of exceptions for certain CWE groups. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: 17th International Conference on Mining Software Repositories

Showing 1–8 of 8 results for author: Antal, G