Search | arXiv e-print repository

DocChecker: Bootstrap** Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies

Authors: Anh T. V. Dau, ** L. C. Guo, Nghi D. Q. Bui

Abstract: Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current… ▽ More Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current methods rely primarily on heuristic rules. In contrast, this paper presents DocChecker, a tool powered by deep learning. DocChecker is adept at identifying inconsistencies between code and comments, and it can also generate synthetic comments. This capability enables the tool to detect and correct instances where comments do not accurately reflect their corresponding code segments. We demonstrate the effectiveness of DocChecker using the Just-In-Time and CodeXGlue datasets in different settings. Particularly, DocChecker achieves a new State-of-the-art result of 72.3% accuracy on the Inconsistency Code-Comment Detection (ICCD) task and 33.64 BLEU-4 on the code summarization task against other Large Language Models (LLMs), even surpassing GPT 3.5 and CodeLlama. DocChecker is accessible for use and evaluation. It can be found on our GitHub https://github.com/FSoft-AI4Code/DocChecker and as an Online Tool http://4.193.50.237:5000/. For a more comprehensive understanding of its functionality, a demonstration video is available on YouTube https://youtu.be/FqnPmd531xw. △ Less

Submitted 2 February, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

Journal ref: EACL 2024 - Demonstration track

arXiv:2305.06156 [pdf, other]

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, ** Guo, Nghi D. Q. Bui

Abstract: We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text… ▽ More We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models. △ Less

Submitted 30 October, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at EMNLP 2023, Long Findings

arXiv:2305.01384 [pdf, other]

Class based Influence Functions for Error Detection

Authors: Thang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran, Dang Huu-Tien, Hieu Ngoc Nguyen, Anh T. V. Dau, Nghi D. Q. Bui

Abstract: Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information… ▽ More Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 2023

arXiv:2205.13022 [pdf, ps, other]

doi 10.1145/3551349.3561168

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Authors: Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi D. Q. Bui

Abstract: Despite the recent trend of develo** and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the… ▽ More Despite the recent trend of develo** and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of develo** better neural source code models from a data-centric perspective, which is a key driver for develo** useful source code models in practice. △ Less

Submitted 2 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: The 37th IEEE/ACM International Conference on Automated Software Engineering

Showing 1–4 of 4 results for author: Dau, A T V