Search | arXiv e-print repository

An Empirical Study on Bugs Inside PyTorch: A Replication Study

Authors: Sharon Chee Yin Ho, Vahid Majdinasab, Mohayeminul Islam, Diego Elias Costa, Emad Shihab, Foutse Khomh, Sarah Nadi, Muhammad Raza

Abstract: Software systems are increasingly relying on deep learning components, due to their remarkable capability of identifying complex data patterns and powering intelligent behaviour. A core enabler of this change in software development is the availability of easy-to-use deep learning libraries. Libraries like PyTorch and TensorFlow empower a large variety of intelligent systems, offering a multitude… ▽ More Software systems are increasingly relying on deep learning components, due to their remarkable capability of identifying complex data patterns and powering intelligent behaviour. A core enabler of this change in software development is the availability of easy-to-use deep learning libraries. Libraries like PyTorch and TensorFlow empower a large variety of intelligent systems, offering a multitude of algorithms and configuration options, applicable to numerous domains of systems. However, bugs in those popular deep learning libraries also may have dire consequences for the quality of systems they enable; thus, it is important to understand how bugs are identified and fixed in those libraries. Inspired by a study of Jia et al., which investigates the bug identification and fixing process at TensorFlow, we characterize bugs in the PyTorch library, a very popular deep learning framework. We investigate the causes and symptoms of bugs identified during PyTorch's development, and assess their locality within the project, and extract patterns of bug fixes. Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics. Finally, we also compare our results with the study on TensorFlow, highlighting similarities and differences across the bug identification and fixing process. △ Less

Submitted 1 August, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

arXiv:2304.04983 [pdf, other]

A Data Set of Generalizable Python Code Change Patterns

Authors: Akalanka Galappaththi, Sarah Nadi

Abstract: Mining repetitive code changes from version control history is a common way of discovering unknown change patterns. Such change patterns can be used in code recommender systems or automated program repair techniques. While there are such tools and datasets exist for Java, there is little work on finding and recommending such changes in Python. In this paper, we present a data set of manually vette… ▽ More Mining repetitive code changes from version control history is a common way of discovering unknown change patterns. Such change patterns can be used in code recommender systems or automated program repair techniques. While there are such tools and datasets exist for Java, there is little work on finding and recommending such changes in Python. In this paper, we present a data set of manually vetted generalizable Python repetitive code change patterns. We create a coding guideline to identify generalizable change patterns that can be used in automated tooling. We leverage the mined change patterns from recent work that mines repetitive changes in Python projects and use our coding guideline to manually review the patterns. For each change, we also record a description of the change and why it is applied along with other characteristics such as the number of projects it occurs in. This review process allows us to identify and share 72 Python change patterns that can be used to build and advance Python developer support tools. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2302.06527 [pdf, other]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Authors: Max Schäfer, Sarah Nadi, Aryaz Eghbali, Frank Tip

Abstract: Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLM… ▽ More Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model. △ Less

Submitted 11 December, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

arXiv:2207.01124 [pdf, other]

doi 10.1145/3643731

Characterizing Python Library Migrations

Authors: Mohayeminul Islam, Ajay Kumar Jha, Ildar Akhmetov, Sarah Nadi

Abstract: Developers heavily rely on Application Programming Interfaces (APIs) from libraries to build their software. As software evolves, developers may need to replace the used libraries with alternate libraries, a process known as library migration. Doing this manually can be tedious, time-consuming, and prone to errors. Automated migration techniques can help alleviate some of this burden. However, des… ▽ More Developers heavily rely on Application Programming Interfaces (APIs) from libraries to build their software. As software evolves, developers may need to replace the used libraries with alternate libraries, a process known as library migration. Doing this manually can be tedious, time-consuming, and prone to errors. Automated migration techniques can help alleviate some of this burden. However, designing effective automated migration techniques requires understanding the types of code changes required to transform client code that used the old library to the new library. This paper contributes an empirical study that provides a holistic view of Python library migrations, both in terms of the code changes required in a migration and the typical development effort involved. We manually label 3,096 migration-related code changes in 335 Python library migrations from 311 client repositories spanning 141 library pairs from 35 domains. Based on our labeled data, we derive a taxonomy for describing migration-related code changes, PyMigTax. Leveraging PyMigTax and our labeled data, we investigate various characteristics of Python library migrations, such as the types of program elements and properties of API map**s, the combinations of types of migration-related code changes in a migration, and the typical development effort required for a migration. Our findings highlight various potential shortcomings of current library migration tools. For example, we find that 40% of library pairs have API map**s that involve non-function program elements, while most library migration techniques typically assume that function calls from the source library will map into (one or more) function calls from the target library. As an approximation for the development effort involved, we find that, on average, a developer needs to learn about 4 APIs and 2 API map**s to perform a migration, and ... (truncated) △ Less

Submitted 29 January, 2024; v1 submitted 3 July, 2022; originally announced July 2022.

arXiv:2204.00110 [pdf, other]

doi 10.1145/3524842.3528435

Does This Apply to Me? An Empirical Study of Technical Context in Stack Overflow

Authors: Akalanka Galappaththi, Sarah Nadi, Christoph Treude

Abstract: Stack Overflow has become an essential technical resource for developers. However, given the vast amount of knowledge available on Stack Overflow, finding the right information that is relevant for a given task is still challenging, especially when a developer is looking for a solution that applies to their specific requirements or technology stack. Clearly marking answers with their technical con… ▽ More Stack Overflow has become an essential technical resource for developers. However, given the vast amount of knowledge available on Stack Overflow, finding the right information that is relevant for a given task is still challenging, especially when a developer is looking for a solution that applies to their specific requirements or technology stack. Clearly marking answers with their technical context, i.e., the information that characterizes the technologies and assumptions needed for this answer, is potentially one way to improve navigation. However, there is no information about how often such context is mentioned, and what kind of information it might offer. In this paper, we conduct an empirical study to understand the occurrence of technical context in Stack Overflow answers and comments, using tags as a proxy for technical context. We specifically focus on additional context, where answers/comments mention information that is not already discussed in the question. Our results show that nearly half of our studied threads contain at least one additional context. We find that almost 50% of the additional context are either a library/framework, a programming language, a tool/application, an API, or a database. Overall, our findings show the promise of using additional context as navigational cues. △ Less

Submitted 31 March, 2022; originally announced April 2022.

Journal ref: 19th International Conference on Mining Software Repositories (MSR '22), May 23--24, 2022, Pittsburgh, PA, USA

arXiv:2112.10370 [pdf, other]

Operation-based Refactoring-aware Merging: An Empirical Evaluation

Authors: Max Ellis, Sarah Nadi, Danny Dig

Abstract: Dealing with merge conflicts in version control systems is a challenging task for software developers. Resolving merge conflicts is a time-consuming and error-prone process, which distracts developers from important tasks. Recent work shows that refactorings are often involved in merge conflicts and that refactoring-related conflicts tend to be larger, making them harder to resolve. In the literat… ▽ More Dealing with merge conflicts in version control systems is a challenging task for software developers. Resolving merge conflicts is a time-consuming and error-prone process, which distracts developers from important tasks. Recent work shows that refactorings are often involved in merge conflicts and that refactoring-related conflicts tend to be larger, making them harder to resolve. In the literature, there are two refactoring-aware merging techniques that claim to automatically resolve refactoring-related conflicts; however, these two techniques have never been empirically compared. In this paper, we present RefMerge, a rejuvenated Java-based design and implementation of the first technique, which is an operation-based refactoring-aware merging algorithm. We compare RefMerge to Git and the state-of-the-art graph-based refactoring-aware merging tool, IntelliMerge, on 2,001 merge scenarios with refactoring-related conflicts from 20 open-source projects. We find that RefMerge resolves or reduces conflicts in 497 (25%) merge scenarios while increasing conflicting LOC in only 214 (11%) scenarios. On the other hand, we find that IntelliMerge resolves or reduces conflicts in 478 (24%) merge scenarios but increases conflicting LOC in 597 (30%) merge scenarios. We additionally conduct a qualitative analysis of the differences between the three merging algorithms and provide insights of the strengths and weaknesses of each tool. We find that while IntelliMerge does well with ordering and formatting conflicts, it struggles with class-level refactorings and scenarios with several refactorings. On the other hand, RefMerge is resilient to the number of refactorings in a merge scenario, but we find that RefMerge introduces conflicts when inverting move-related refactorings. △ Less

Submitted 9 August, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

Comments: 23 pages, 17 figures, 7 tables

arXiv:2012.15342 [pdf, other]

ConfigFix: Interactive Configuration Conflict Resolution for the Linux Kernel

Authors: Patrick Franz, Thorsten Berger, Ibrahim Fayaz, Sarah Nadi, Evgeny Groshev

Abstract: Highly configurable systems are highly complex systems, with the Linux kernel arguably being one of the most well-known ones. Since 2007, it has been a frequent target of the research community, conducting empirical studies and building dedicated methods and tools for analyzing, configuring, testing, optimizing, and maintaining the kernel in the light of its vast configuration space. However, desp… ▽ More Highly configurable systems are highly complex systems, with the Linux kernel arguably being one of the most well-known ones. Since 2007, it has been a frequent target of the research community, conducting empirical studies and building dedicated methods and tools for analyzing, configuring, testing, optimizing, and maintaining the kernel in the light of its vast configuration space. However, despite a large body of work, mainly bug fixes that were the result of such research made it back into the kernel's source tree. Unfortunately, Linux users still struggle with kernel configuration and resolving configuration conflicts, since the kernel largely lacks automated support. Additionally, there are technical and community requirements for supporting automated conflict resolution in the kernel, such as, for example, using a pure C-based solution that uses only compatible third-party libraries (if any). With the aim of contributing back to the Linux community, we present CONFIGFIX, a tooling that we integrated with the kernel configurator, that is purely implemented in C, and that is finally a working solution able to produce fixes for configuration conflicts. In this experience report, we describe our experiences ranging over a decade of building upon the large body of work from research on the Linux kernel configuration mechanisms as well as how we designed and realized CONFIGFIX while adhering to the Linux kernel's community requirements and standards. While CONFIGFIX helps Linux kernel users obtaining their desired configuration, the sound semantic abstraction we implement provides the basis for many of the above techniques supporting kernel configuration, hel** researchers and kernel developers. △ Less

Submitted 30 December, 2020; originally announced December 2020.

arXiv:2004.08378 [pdf, other]

On Using Stack Overflow Comment-Edit Pairs to recommend code maintenance changes

Authors: Henry Tang, Sarah Nadi

Abstract: Code maintenance data sets typically consist of a before and after version of the code that contains the improvement or fix. Such data sets are important for software engineering support tools related to code maintenance, such as program repair, code recommender systems, or Application Programming Interface (API) misuse detection. Most of the current data sets are constructed from mining commit hi… ▽ More Code maintenance data sets typically consist of a before and after version of the code that contains the improvement or fix. Such data sets are important for software engineering support tools related to code maintenance, such as program repair, code recommender systems, or Application Programming Interface (API) misuse detection. Most of the current data sets are constructed from mining commit history in version-control systems or issues in issue-tracking systems. In this paper, we investigate whether Stack Overflow can be used as an additional data source. Comments on Stack Overflow provide an effective way for developers to point out problems with existing answers, alternative solutions, or pitfalls. In this paper, we mine comment-edit pairs from Stack Overflow and investigate their potential usefulness. These pairs have the added benefit of having concrete descriptions of why the change is needed as well as potentially having less tangled changes to deal with. We first design a technique to extract related comment-edit pairs and then investigate the nature of these pairs. We find that the majority of comment-edit pairs are not tangled, but only 27% of the studied pairs are potentially useful for the above applications. We categorize the types of mined pairs and find that the highest ratio of useful pairs come from categories Correction, Obsolete, Flaw, and Extension. To demonstrate the effectiveness of our extracted pairs, we submitted 15 pull requests on GitHub, 10 of which have been accepted to widely used repositories such as Apache Beam and nltk. Our work is the first to investigate Stack Overflow comment-edit pairs and opens the door for future work in this direction. Based on our findings and observations, we provide concrete suggestions on how to potentially identify a larger set of useful comment-edit pairs, which can also be facilitated by our shared data. △ Less

Submitted 25 January, 2021; v1 submitted 17 April, 2020; originally announced April 2020.

Comments: 34 pages, 9 tables, 2 figures, submitted to EMSE

arXiv:1912.13455 [pdf, other]

Essential Sentences for Navigating Stack Overflow Answers

Authors: Sarah Nadi, Christoph Treude

Abstract: Stack Overflow (SO) has become an essential resource for software development. Despite its success and prevalence, navigating SO remains a challenge. Ideally, SO users could benefit from highlighted navigational cues that help them decide if an answer is relevant to their task and context. Such navigational cues could be in the form of essential sentences that help the searcher decide whether they… ▽ More Stack Overflow (SO) has become an essential resource for software development. Despite its success and prevalence, navigating SO remains a challenge. Ideally, SO users could benefit from highlighted navigational cues that help them decide if an answer is relevant to their task and context. Such navigational cues could be in the form of essential sentences that help the searcher decide whether they want to read the answer or skip over it. In this paper, we compare four potential approaches for identifying essential sentences. We adopt two existing approaches and develop two new approaches based on the idea that contextual information in a sentence (e.g., "if using windows") could help identify essential sentences. We compare the four techniques using a survey of 43 participants. Our participants indicate that it is not always easy to figure out what the best solution for their specific problem is, given the options, and that they would indeed like to easily spot contextual information that may narrow down the search. Our quantitative comparison of the techniques shows that there is no single technique sufficient for identifying essential sentences that can serve as navigational cues, while our qualitative analysis shows that participants valued explanations and specific conditions, and did not value filler sentences or speculations. Our work sheds light on the importance of navigational cues, and our findings can be used to guide future research to find the best combination of techniques to identify such cues. △ Less

Submitted 31 December, 2019; originally announced December 2019.

Comments: to appear as full paper at SANER 2020, the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering

arXiv:1907.06274 [pdf, other]

Predicting Merge Conflicts in Collaborative Software Development

Authors: Moein Owhadi-Kareshk, Sarah Nadi, Julia Rubin

Abstract: Background. During collaborative software development, developers often use branches to add features or fix bugs. When merging changes from two branches, conflicts may occur if the changes are inconsistent. Developers need to resolve these conflicts before completing the merge, which is an error-prone and time-consuming process. Early detection of merge conflicts, which warns developers about reso… ▽ More Background. During collaborative software development, developers often use branches to add features or fix bugs. When merging changes from two branches, conflicts may occur if the changes are inconsistent. Developers need to resolve these conflicts before completing the merge, which is an error-prone and time-consuming process. Early detection of merge conflicts, which warns developers about resolving conflicts before they become large and complicated, is among the ways of dealing with this problem. Existing techniques do this by continuously pulling and merging all combinations of branches in the background to notify developers as soon as a conflict occurs, which is a computationally expensive process. One potential way for reducing this cost is to use a machine-learning based conflict predictor that filters out the merge scenarios that are not likely to have conflicts, ie safe merge scenarios. Aims. In this paper, we assess if conflict prediction is feasible. Method. We design a classifier for predicting merge conflicts, based on 9 light-weight Git feature sets. To evaluate our predictor, we perform a large-scale study on 267, 657 merge scenarios from 744 GitHub repositories in seven programming languages. Results. Our results show that we achieve high f1-scores, varying from 0.95 to 0.97 for different programming languages, when predicting safe merge scenarios. The f1-score is between 0.57 and 0.68 for the conflicting merge scenarios. Conclusions. Predicting merge conflicts is feasible in practice, especially in the context of predicting safe merge scenarios as a pre-filtering step for speculative merging. △ Less

Submitted 14 July, 2019; originally announced July 2019.

arXiv:1801.02716 [pdf, other]

doi 10.1145/3196398.3196434

The Android Update Problem: An Empirical Study

Authors: Mehran Mahmoudi, Sarah Nadi

Abstract: Many phone vendors use Android as their underlying OS, but often extend it to add new functionality and to make it compatible with their specific phones. When a new version of Android is released, phone vendors need to merge or re-apply their customizations and changes to the new release. This is a difficult and time-consuming process, which often leads to late adoption of new versions. In this pa… ▽ More Many phone vendors use Android as their underlying OS, but often extend it to add new functionality and to make it compatible with their specific phones. When a new version of Android is released, phone vendors need to merge or re-apply their customizations and changes to the new release. This is a difficult and time-consuming process, which often leads to late adoption of new versions. In this paper, we perform an empirical study to understand the nature of changes that phone vendors make, versus changes made in the original development of Android. By investigating the overlap of different changes, we also determine the possibility of having automated support for merging them. We develop a publicly available tool chain, based on a combination of existing tools, to study such changes and their overlap. As a proxy case study, we analyze the changes in the popular community-based variant of Android, LineageOS, and its corresponding Android versions. We investigate and report the common types of changes that occur in practice. Our findings show that 83% of subsystems modified by LineageOS are also modified in the next release of Android. By taking the nature of overlap** changes into account, we assess the feasibility of having automated tool support to help phone vendors with the Android update problem. Our results show that 56% of the changes in LineageOS have the potential to be safely automated. △ Less

Submitted 20 March, 2018; v1 submitted 8 January, 2018; originally announced January 2018.

Comments: 11 pages, 4 figures, 4 tables

arXiv:1712.00242 [pdf, other]

A Systematic Evaluation of Static API-Misuse Detectors

Authors: Sven Amann, Hoan Anh Nguyen, Sarah Nadi, Tien N. Nguyen, Mira Mezini

Abstract: Application Programming Interfaces (APIs) often have usage constraints, such as restrictions on call order or call conditions. API misuses, i.e., violations of these constraints, may lead to software crashes, bugs, and vulnerabilities. Though researchers developed many API-misuse detectors over the last two decades, recent studies show that API misuses are still prevalent. Therefore, we need to un… ▽ More Application Programming Interfaces (APIs) often have usage constraints, such as restrictions on call order or call conditions. API misuses, i.e., violations of these constraints, may lead to software crashes, bugs, and vulnerabilities. Though researchers developed many API-misuse detectors over the last two decades, recent studies show that API misuses are still prevalent. Therefore, we need to understand the capabilities and limitations of existing detectors in order to advance the state of the art. In this paper, we present the first-ever qualitative and quantitative evaluation that compares static API-misuse detectors along the same dimensions, and with original author validation. To accomplish this, we develop MUC, a classification of API misuses, and MUBenchPipe, an automated benchmark for detector comparison, on top of our misuse dataset, MUBench. Our results show that the capabilities of existing detectors vary greatly and that existing detectors, though capable of detecting misuses, suffer from extremely low precision and recall. A systematic root-cause analysis reveals that, most importantly, detectors need to go beyond the naive assumption that a deviation from the most-frequent usage corresponds to a misuse and need to obtain additional usage examples to train their models. We present possible directions towards more-powerful API-misuse detectors. △ Less

Submitted 13 March, 2018; v1 submitted 1 December, 2017; originally announced December 2017.

Comments: Accepted for publication in IEEE Transactions on Software Engineering, March 12, 2018 Artifact page: http://www.st.informatik.tu-darmstadt.de/artifacts/mustudy/ 19 pages; 1 figure; 9 tables; 6 listings

Showing 1–12 of 12 results for author: Nadi, S