Skip to main content

Showing 1–4 of 4 results for author: Dau, A T V

.
  1. arXiv:2306.06347  [pdf, other

    cs.SE

    DocChecker: Bootstrap** Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies

    Authors: Anh T. V. Dau, ** L. C. Guo, Nghi D. Q. Bui

    Abstract: Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current… ▽ More

    Submitted 2 February, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

    Journal ref: EACL 2024 - Demonstration track

  2. arXiv:2305.06156  [pdf, other

    cs.CL cs.AI cs.PL cs.SE

    The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

    Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, ** Guo, Nghi D. Q. Bui

    Abstract: We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text… ▽ More

    Submitted 30 October, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023, Long Findings

  3. arXiv:2305.01384  [pdf, other

    cs.CL cs.LG

    Class based Influence Functions for Error Detection

    Authors: Thang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran, Dang Huu-Tien, Hieu Ngoc Nguyen, Anh T. V. Dau, Nghi D. Q. Bui

    Abstract: Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 2023

  4. arXiv:2205.13022  [pdf, ps, other

    cs.SE cs.AI cs.PL

    Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

    Authors: Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi D. Q. Bui

    Abstract: Despite the recent trend of develo** and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the… ▽ More

    Submitted 2 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: The 37th IEEE/ACM International Conference on Automated Software Engineering