-
The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks
Authors:
Ashwin Prasad Shivarpatna Venkatesh,
Samkutty Sabu,
Amir M. Mir,
Sofia Reis,
Eric Bodden
Abstract:
The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT seri…
▽ More
The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT series and open-source models such as LLaMA. Our study reveals that LLMs show promising results in type inference, demonstrating higher accuracy than traditional methods, yet they exhibit limitations in callgraph analysis. This contrast emphasizes the need for specialized fine-tuning of LLMs to better suit specific static analysis tasks. Our findings provide a foundation for further research towards integrating LLMs for static analysis tasks.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools
Authors:
Ashwin Prasad Shivarpatna Venkatesh,
Samkutty Sabu,
Jiawei Wang,
Amir M. Mir,
Li Li,
Eric Bodden
Abstract:
In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a comprehensive micro-benchmarking framework for evaluating type inference tools. TypeEvalPy contains 154 code snippets with 845 type annotations across 18 categori…
▽ More
In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a comprehensive micro-benchmarking framework for evaluating type inference tools. TypeEvalPy contains 154 code snippets with 845 type annotations across 18 categories that target various Python features. The framework manages the execution of containerized tools, transforms inferred types into a standardized format, and produces meaningful metrics for assessment. Through our analysis, we compare the performance of six type inference tools, highlighting their strengths and limitations. Our findings provide a foundation for further research and optimization in the domain of Python type inference.
△ Less
Submitted 2 January, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
Static Analysis Driven Enhancements for Comprehension in Machine Learning Notebooks
Authors:
Ashwin Prasad Shivarpatna Venkatesh,
Samkutty Sabu,
Mouli Chekkapalli,
Jiawei Wang,
Li Li,
Eric Bodden
Abstract:
Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations. Data scientists use Jupyter notebook as the de-facto standard for creating and sharing machine-learning based solutions, primarily written in Python. Recent studies have demonstrated, however, that a large portion of Jupyter notebooks available on public platforms are undocumented and lacks a…
▽ More
Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations. Data scientists use Jupyter notebook as the de-facto standard for creating and sharing machine-learning based solutions, primarily written in Python. Recent studies have demonstrated, however, that a large portion of Jupyter notebooks available on public platforms are undocumented and lacks a narrative structure. This reduces the readability of these notebooks. To address this shortcoming, this paper presents HeaderGen, a novel tool-based approach that automatically annotates code cells with categorical markdown headers based on a taxonomy of ML operations, and classifies and displays function calls according to this taxonomy. For this functionality to be realized, HeaderGen enhances an existing call graph analysis in PyCG. To improve precision, HeaderGen extends PyCG's analysis with support for handling external library code and flow-sensitivity. The former is realized by facilitating the resolution of function return-types. The evaluation on 15 real-world Jupyter notebooks from Kaggle shows that HeaderGen's underlying call graph analysis yields high accuracy (95.6% precision and 95.3% recall). This is because HeaderGen can resolve return-types of external libraries where existing type inference tools such as pytype (by Google), pyright (by Microsoft), and Jedi fall short. The header generation has a precision of 85.7% and a recall rate of 92.8%. In a user study, HeaderGen helps participants finish comprehension and navigation tasks faster. To further evaluate the type inference capability of tools, we introduce TypeEvalPy, a framework for evaluating type inference tools with a micro-benchmark containing 154 code snippets and 845 type annotations. Our comparative analysis on four tools revealed that HeaderGen outperforms other tools in exact matches with the ground truth.
△ Less
Submitted 11 June, 2024; v1 submitted 11 January, 2023;
originally announced January 2023.