-
Semantically Aligned Question and Code Generation for Automated Insight Generation
Authors:
Ananya Singha,
Bhavya Chopra,
Anirudh Khatry,
Sumit Gulwani,
Austin Z. Henley,
Vu Le,
Chris Parnin,
Mukul Singh,
Gust Verbruggen
Abstract:
Automated insight generation is a common tactic for hel** knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to…
▽ More
Automated insight generation is a common tactic for hel** knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.
△ Less
Submitted 21 March, 2024;
originally announced May 2024.
-
Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants
Authors:
Bhavya Chopra,
Yasharth Bajpai,
Param Biyani,
Gustavo Soares,
Arjun Radhakrishna,
Chris Parnin,
Sumit Gulwani
Abstract:
The widespread availability of Large Language Models (LLMs) within Integrated Development Environments (IDEs) has led to their speedy adoption. Conversational interactions with LLMs enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate respo…
▽ More
The widespread availability of Large Language Models (LLMs) within Integrated Development Environments (IDEs) has led to their speedy adoption. Conversational interactions with LLMs enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate responses. Conversations between developers and LLMs are primarily structured as question-answer pairs, where the developer is responsible for asking the the right questions and sustaining conversations across multiple turns. In this paper, we draw inspiration from interaction patterns and conversation analysis -- to design Robin, an enhanced conversational AI-assistant for debugging. Through a within-subjects user study with 12 industry professionals, we find that equip** the LLM to -- (1) leverage the insert expansion interaction pattern, (2) facilitate turn-taking, and (3) utilize debugging workflows -- leads to lowered conversation barriers, effective fault localization, and 5x improvement in bug resolution rates.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Building Your Own Product Copilot: Challenges, Opportunities, and Needs
Authors:
Chris Parnin,
Gustavo Soares,
Rahul Pandita,
Sumit Gulwani,
Jessica Rich,
Austin Z. Henley
Abstract:
A race is underway to embed advanced AI capabilities into products. These product copilots enable users to ask questions in natural language and receive relevant responses that are specific to the user's context. In fact, virtually every large technology company is looking to add these capabilities to their software products. However, for most software engineers, this is often their first encounte…
▽ More
A race is underway to embed advanced AI capabilities into products. These product copilots enable users to ask questions in natural language and receive relevant responses that are specific to the user's context. In fact, virtually every large technology company is looking to add these capabilities to their software products. However, for most software engineers, this is often their first encounter with integrating AI-powered technology. Furthermore, software engineering processes and tools have not caught up with the challenges and scale involved with building AI-powered applications. In this work, we present the findings of an interview study with 26 professional software engineers responsible for building product copilots at various companies. From our interviews, we found pain points at every step of the engineering process and the challenges that strained existing development practices. We then conducted group brainstorming sessions to collaborative on opportunities and tool designs for the broader software engineering community.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Conversational Challenges in AI-Powered Data Science: Obstacles, Needs, and Design Opportunities
Authors:
Bhavya Chopra,
Ananya Singha,
Anna Fariha,
Sumit Gulwani,
Chris Parnin,
Ashish Tiwari,
Austin Z. Henley
Abstract:
Large Language Models (LLMs) are being increasingly employed in data science for tasks like data preprocessing and analytics. However, data scientists encounter substantial obstacles when conversing with LLM-powered chatbots and acting on their suggestions and answers. We conducted a mixed-methods study, including contextual observations, semi-structured interviews (n=14), and a survey (n=114), to…
▽ More
Large Language Models (LLMs) are being increasingly employed in data science for tasks like data preprocessing and analytics. However, data scientists encounter substantial obstacles when conversing with LLM-powered chatbots and acting on their suggestions and answers. We conducted a mixed-methods study, including contextual observations, semi-structured interviews (n=14), and a survey (n=114), to identify these challenges. Our findings highlight key issues faced by data scientists, including contextual data retrieval, formulating prompts for complex tasks, adapting generated code to local environments, and refining prompts iteratively. Based on these insights, we propose actionable design recommendations, such as data brushing to support context selection, and inquisitive feedback loops to improve communications with AI-based assistants in data-science tools.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
Authors:
Ananya Singha,
José Cambronero,
Sumit Gulwani,
Vu Le,
Chris Parnin
Abstract:
Large language models (LLMs) are increasingly applied for tabular tasks using in-context learning. The prompt representation for a table may play a role in the LLMs ability to process the table. Inspired by prior work, we generate a collection of self-supervised structural tasks (e.g. navigate to a cell and row; transpose the table) and evaluate the performance differences when using 8 formats. In…
▽ More
Large language models (LLMs) are increasingly applied for tabular tasks using in-context learning. The prompt representation for a table may play a role in the LLMs ability to process the table. Inspired by prior work, we generate a collection of self-supervised structural tasks (e.g. navigate to a cell and row; transpose the table) and evaluate the performance differences when using 8 formats. In contrast to past work, we introduce 8 noise operations inspired by real-world messy data and adversarial inputs, and show that such operations can impact LLM performance across formats for different structural understanding tasks.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Correlates of Programmer Efficacy and Their Link to Experience: A Combined EEG and Eye-Tracking Study
Authors:
Norman Peitek,
Annabelle Bergum,
Maurice Rekrut,
Jonas Mucke,
Matthias Nadig,
Chris Parnin,
Janet Siegmund,
Sven Apel
Abstract:
Background: Despite similar education and background, programmers can exhibit vast differences in efficacy. While research has identified some potential factors, such as programming experience and domain knowledge, the effect of these factors on programmers' efficacy is not well understood.
Aims: We aim at unraveling the relationship between efficacy (speed and correctness) and measures of progr…
▽ More
Background: Despite similar education and background, programmers can exhibit vast differences in efficacy. While research has identified some potential factors, such as programming experience and domain knowledge, the effect of these factors on programmers' efficacy is not well understood.
Aims: We aim at unraveling the relationship between efficacy (speed and correctness) and measures of programming experience. We further investigate the correlates of programmer efficacy in terms of reading behavior and cognitive load.
Method: For this purpose, we conducted a controlled experiment with 37~participants using electroencephalography (EEG) and eye tracking. We asked participants to comprehend up to 32 Java source-code snippets and observed their eye gaze and neural correlates of cognitive load. We analyzed the correlation of participants' efficacy with popular programming experience measures.
Results: We found that programmers with high efficacy read source code more targeted and with lower cognitive load. Commonly used experience levels do not predict programmer efficacy well, but self-estimation and indicators of learning eagerness are fairly accurate.
Implications: The identified correlates of programmer efficacy can be used for future research and practice (e.g., hiring). Future research should also consider efficacy as a group sampling method, rather than using simple experience measures.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Detecting and Characterizing Propagation of Security Weaknesses in Puppet-based Infrastructure Management
Authors:
Akond Rahman,
Chris Parnin
Abstract:
Despite being beneficial for managing computing infrastructure automatically, Puppet manifests are susceptible to security weaknesses, e.g., hard-coded secrets and use of weak cryptography algorithms. Adequate mitigation of security weaknesses in Puppet manifests is thus necessary to secure computing infrastructure that are managed with Puppet manifests. A characterization of how security weakness…
▽ More
Despite being beneficial for managing computing infrastructure automatically, Puppet manifests are susceptible to security weaknesses, e.g., hard-coded secrets and use of weak cryptography algorithms. Adequate mitigation of security weaknesses in Puppet manifests is thus necessary to secure computing infrastructure that are managed with Puppet manifests. A characterization of how security weaknesses propagate and affect Puppet-based infrastructure management, can inform practitioners on the relevance of the detected security weaknesses, as well as help them take necessary actions for mitigation. To that end, we conduct an empirical study with 17,629 Puppet manifests mined from 336 open source repositories. We construct Taint Tracker for Puppet Manifests (TaintPup), for which we observe 2.4 times more precision compared to that of a state-of-the-art security static analysis tool. TaintPup leverages Puppet-specific information flow analysis using which we characterize propagation of security weaknesses. From our empirical study, we observe security weaknesses to propagate into 4,457 resources, i.e, Puppet-specific code elements used to manage infrastructure. A single instance of a security weakness can propagate into as many as 35 distinct resources. We observe security weaknesses to propagate into 7 categories of resources, which include resources used to manage continuous integration servers and network controllers. According to our survey with 24 practitioners, propagation of security weaknesses into data storage-related resources is rated to have the most severe impact for Puppet-based infrastructure management.
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
Dozer: Migrating Shell Commands to Ansible Modules via Execution Profiling and Synthesis
Authors:
Eric Horton,
Chris Parnin
Abstract:
Software developers frequently use the system shell to perform configuration management tasks. Unfortunately, the shell does not scale well to large systems, and configuration management tools like Ansible are more difficult to learn. We address this problem with Dozer, a technique to help developers push their shell commands into Ansible task definitions. It operates by tracing and comparing syst…
▽ More
Software developers frequently use the system shell to perform configuration management tasks. Unfortunately, the shell does not scale well to large systems, and configuration management tools like Ansible are more difficult to learn. We address this problem with Dozer, a technique to help developers push their shell commands into Ansible task definitions. It operates by tracing and comparing system calls to find Ansible modules with similar behaviors to shell commands, then generating and validating migrations to find the task which produces the most similar changes to the system. Dozer is syntax agnostic, which should allow it to generalize to other configuration management platforms. We evaluate Dozer using datasets from open source configuration scripts.
△ Less
Submitted 22 March, 2022;
originally announced March 2022.
-
Nudging Students Toward Better Software Engineering Behaviors
Authors:
Chris Brown,
Chris Parnin
Abstract:
Student experiences in large undergraduate Computer Science courses are increasingly impacted by automated systems. Bots, or agents of software automation, are useful for efficiently grading and generating feedback. Current efforts at automation in CS education focus on supporting instructional tasks, but do not address student struggles due to poor behaviors, such as procrastination. In this pape…
▽ More
Student experiences in large undergraduate Computer Science courses are increasingly impacted by automated systems. Bots, or agents of software automation, are useful for efficiently grading and generating feedback. Current efforts at automation in CS education focus on supporting instructional tasks, but do not address student struggles due to poor behaviors, such as procrastination. In this paper, we explore using bots to improve the software engineering behaviors of students using developer recommendation choice architectures, a framework incorporating behavioral science concepts in recommendations to improve the actions of programmers. We implemented this framework in class-bot, a novel system designed to nudge students to make better choices while working on programming assignments. This work presents a preliminary evaluation integrating this tool in an introductory programming course. Our results show that class-bot is beneficial for improving student development behaviors increasing code quality and productivity.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
SLACC: Simion-based Language Agnostic Code Clones
Authors:
George Mathew,
Chris Parnin,
Kathryn T Stolee
Abstract:
Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying represent…
▽ More
Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying representation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis framework replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior.
In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/output behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior attempts. Since clusters are generated based on input/output behavior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone detection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%).
This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying representation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.
△ Less
Submitted 7 February, 2020;
originally announced February 2020.
-
V2: Fast Detection of Configuration Drift in Python
Authors:
Eric Horton,
Chris Parnin
Abstract:
Code snippets are prevalent, but are hard to reuse because they often lack an accompanying environment configuration. Most are not actively maintained, allowing for drift between the most recent possible configuration and the code snippet as the snippet becomes out-of-date over time. Recent work has identified the problem of validating and detecting out-of-date code snippets as the most important…
▽ More
Code snippets are prevalent, but are hard to reuse because they often lack an accompanying environment configuration. Most are not actively maintained, allowing for drift between the most recent possible configuration and the code snippet as the snippet becomes out-of-date over time. Recent work has identified the problem of validating and detecting out-of-date code snippets as the most important consideration for code reuse. However, determining if a snippet is correct, but simply out-of-date, is a non-trivial task. In the best case, breaking changes are well documented, allowing developers to manually determine when a code snippet contains an out-of-date API usage. In the worst case, determining if and when a breaking change was made requires an exhaustive search through previous dependency versions.
We present V2, a strategy for determining if a code snippet is out-of-date by detecting discrete instances of configuration drift, where the snippet uses an API which has since undergone a breaking change. Each instance of configuration drift is classified by a failure encountered during validation and a configuration patch, consisting of dependency version changes, which fixes the underlying fault. V2 uses feedback-directed search to explore the possible configuration space for a code snippet, reducing the number of potential environment configurations that need to be validated. When run on a corpus of public Python snippets from prior research, V2 identifies 248 instances of configuration drift.
△ Less
Submitted 13 September, 2019;
originally announced September 2019.
-
Security Smells in Ansible and Chef Scripts: A Replication Study
Authors:
Akond Rahman,
Md. Rayhanur Rahman,
Chris Parnin,
Laurie Williams
Abstract:
Context: Security smells are recurring coding patterns that are indicative of security weakness, and require further inspection. As infrastructure as code (IaC) scripts, such as Ansible and Chef scripts, are used to provision cloud-based servers and systems at scale, security smells in IaC scripts could be used to enable malicious users to exploit vulnerabilities in the provisioned systems. Goal:…
▽ More
Context: Security smells are recurring coding patterns that are indicative of security weakness, and require further inspection. As infrastructure as code (IaC) scripts, such as Ansible and Chef scripts, are used to provision cloud-based servers and systems at scale, security smells in IaC scripts could be used to enable malicious users to exploit vulnerabilities in the provisioned systems. Goal: The goal of this paper is to help practitioners avoid insecure coding practices while develo** infrastructure as code scripts through an empirical study of security smells in Ansible and Chef scripts. Methodology: We conduct a replication study where we apply qualitative analysis with 1,956 IaC scripts to identify security smells for IaC scripts written in two languages: Ansible and Chef. We construct a static analysis tool called Security Linter for Ansible and Chef scripts (SLAC) to automatically identify security smells in 50,323 scripts collected from 813 open source software repositories. We also submit bug reports for 1,000 randomly-selected smell occurrences. Results: We identify two security smells not reported in prior work: missing default in case statement and no integrity check. By applying SLAC we identify 46,600 occurrences of security smells that include 7,849 hard-coded passwords. We observe agreement for 65 of the responded 94 bug reports, which suggests the relevance of security smells for Ansible and Chef scripts amongst practitioners. Conclusion: We observe security smells to be prevalent in Ansible and Chef scripts, similar to that of the Puppet scripts. We recommend practitioners to rigorously inspect the presence of the identified security smells in Ansible and Chef scripts using (i) code review, and (ii) static analysis tools.
△ Less
Submitted 20 June, 2020; v1 submitted 16 July, 2019;
originally announced July 2019.
-
DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets
Authors:
Eric Horton,
Chris Parnin
Abstract:
Platforms like Stack Overflow and GitHub's gist system promote the sharing of ideas and programming techniques via the distribution of code snippets designed to illustrate particular tasks. Python, a popular and fast-growing programming language, sees heavy use on both sites, with nearly one million questions asked on Stack Overflow and 400 thousand public gists on GitHub. Unfortunately, around 75…
▽ More
Platforms like Stack Overflow and GitHub's gist system promote the sharing of ideas and programming techniques via the distribution of code snippets designed to illustrate particular tasks. Python, a popular and fast-growing programming language, sees heavy use on both sites, with nearly one million questions asked on Stack Overflow and 400 thousand public gists on GitHub. Unfortunately, around 75% of the Python example code shared through these sites cannot be directly executed. When run in a clean environment, over 50% of public Python gists fail due to an import error for a missing library.
We present DockerizeMe, a technique for inferring the dependencies needed to execute a Python code snippet without import error. DockerizeMe starts with offline knowledge acquisition of the resources and dependencies for popular Python packages from the Python Package Index (PyPI). It then builds Docker specifications using a graph-based inference procedure. Our inference procedure resolves import errors in 892 out of nearly 3,000 gists from the Gistable dataset for which Gistable's baseline approach could not find and install all dependencies.
△ Less
Submitted 27 May, 2019;
originally announced May 2019.
-
It's Like Python But: Towards Supporting Transfer of Programming Language Knowledge
Authors:
Nischal Shrestha,
Titus Barik,
Chris Parnin
Abstract:
Expertise in programming traditionally assumes a binary novice-expert divide. Learning resources typically target programmers who are learning programming for the first time, or expert programmers for that language. An underrepresented, yet important group of programmers are those that are experienced in one programming language, but desire to author code in a different language. For this scenario…
▽ More
Expertise in programming traditionally assumes a binary novice-expert divide. Learning resources typically target programmers who are learning programming for the first time, or expert programmers for that language. An underrepresented, yet important group of programmers are those that are experienced in one programming language, but desire to author code in a different language. For this scenario, we postulate that an effective form of feedback is presented as a transfer from concepts in the first language to the second. Current programming environments do not support this form of feedback.
In this study, we apply the theory of learning transfer to teach a language that programmers are less familiar with--such as R--in terms of a programming language they already know--such as Python. We investigate learning transfer using a new tool called Transfer Tutor that presents explanations for R code in terms of the equivalent Python code. Our study found that participants leveraged learning transfer as a cognitive strategy, even when unprompted. Participants found Transfer Tutor to be useful across a number of affordances like step** through and highlighting facts that may have been missed or misunderstood. However, participants were reluctant to accept facts without code execution or sometimes had difficulty reading explanations that are verbose or complex. These results provide guidance for future designs and research directions that can support learning transfer when learning new programming languages.
△ Less
Submitted 27 August, 2018;
originally announced August 2018.
-
Gistable: Evaluating the Executability of Python Code Snippets on GitHub
Authors:
Eric Horton,
Chris Parnin
Abstract:
Software developers create and share code online to demonstrate programming language concepts and programming tasks. Code snippets can be a useful way to explain and demonstrate a programming concept, but may not always be directly executable. A code snippet can contain parse errors, or fail to execute if the environment contains unmet dependencies.
This paper presents an empirical analysis of t…
▽ More
Software developers create and share code online to demonstrate programming language concepts and programming tasks. Code snippets can be a useful way to explain and demonstrate a programming concept, but may not always be directly executable. A code snippet can contain parse errors, or fail to execute if the environment contains unmet dependencies.
This paper presents an empirical analysis of the executable status of Python code snippets shared through the GitHub gist system, and the ability of developers familiar with software configuration to correctly configure and run them. We find that 75.6% of gists require non-trivial configuration to overcome missing dependencies, configuration files, reliance on a specific operating system, or some other environment configuration. Our study also suggests the natural assumption developers make about resource names when resolving configuration errors is correct less than half the time.
We also present Gistable, a database and extensible framework built on GitHub's gist system, which provides executable code snippets to enable reproducible studies in software engineering. Gistable contains 10,259 code snippets, approximately 5,000 with a Dockerfile to configure and execute them without import error. Gistable is publicly available at https://github.com/gistable/gistable.
△ Less
Submitted 14 August, 2018;
originally announced August 2018.
-
Evaluating How Developers Use General-Purpose Web-Search for Code Retrieval
Authors:
Md Masudur Rahman,
Jed Barson,
Sydney Paul,
Joshua Kayan,
Federico Andres Lois,
Sebastian Fernandez Quezada,
Christopher Parnin,
Kathryn T. Stolee,
Baishakhi Ray
Abstract:
Search is an integral part of a software development process. Developers often use search engines to look for information during development, including reusable code snippets, API understanding, and reference examples. Developers tend to prefer general-purpose search engines like Google, which are often not optimized for code related documents and use search strategies and ranking techniques that…
▽ More
Search is an integral part of a software development process. Developers often use search engines to look for information during development, including reusable code snippets, API understanding, and reference examples. Developers tend to prefer general-purpose search engines like Google, which are often not optimized for code related documents and use search strategies and ranking techniques that are more optimized for generic, non-code related information. In this paper, we explore whether a general purpose search engine like Google is an optimal choice for code-related searches. In particular, we investigate whether the performance of searching with Google varies for code vs. non-code related searches. To analyze this, we collect search logs from 310 developers that contains nearly 150,000 search queries from Google and the associated result clicks. To differentiate between code-related searches and non-code related searches, we build a model which identifies the code intent of queries. Leveraging this model, we build an automatic classifier that detects a code and non-code related query. We confirm the effectiveness of the classifier on manually annotated queries where the classifier achieves a precision of 87%, a recall of 86%, and an F1-score of 87%. We apply this classifier to automatically annotate all the queries in the dataset. Analyzing this dataset, we observe that code related searching often requires more effort (e.g., time, result clicks, and query modifications) than general non-code search, which indicates code search performance with a general search engine is less effective.
△ Less
Submitted 22 March, 2018;
originally announced March 2018.
-
Code Drones
Authors:
Mithun P. Acharya,
Chris Parnin,
Nicholas A. Kraft,
Aldo Dagnino,
Xiao Qu
Abstract:
We propose and explore a new paradigm called Code Drones in which every software artifact such as a class is an intelligent and socially active entity. In this paradigm, humanized artifacts take the lead and choreograph (socially, in collaboration with other intelligent software artifacts and humans) automated software engineering solutions to a myriad of development and maintenance challenges, in…
▽ More
We propose and explore a new paradigm called Code Drones in which every software artifact such as a class is an intelligent and socially active entity. In this paradigm, humanized artifacts take the lead and choreograph (socially, in collaboration with other intelligent software artifacts and humans) automated software engineering solutions to a myriad of development and maintenance challenges, including API migration, reuse, documentation, testing, patching, and refactoring. We discuss the implications of having social and intelligent/cognitive software artifacts that guide their own self-improvement.
△ Less
Submitted 16 February, 2016; v1 submitted 22 November, 2014;
originally announced November 2014.