Skip to main content

Showing 1–23 of 23 results for author: Bui, N D Q

.
  1. arXiv:2406.11927  [pdf, other

    cs.SE cs.AI

    REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

    Authors: Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui

    Abstract: The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafte… ▽ More

    Submitted 19 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.11912  [pdf, other

    cs.SE cs.AI

    AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

    Authors: Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, Nghi D. Q. Bui

    Abstract: Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify software development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles such as Product Manager, Developer, and Tester to di… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  3. arXiv:2403.14592  [pdf, other

    cs.SE cs.AI cs.HC

    Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

    Authors: Khanh Nghiem, Anh Minh Nguyen, Nghi D. Q. Bui

    Abstract: As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience develo** in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open q… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  4. arXiv:2403.06095  [pdf, other

    cs.SE cs.AI

    RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion

    Authors: Huy N. Phan, Hoang N. Phan, Tien N. Nguyen, Nghi D. Q. Bui

    Abstract: Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed… ▽ More

    Submitted 16 March, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

    Comments: Under Review

  5. arXiv:2311.03366  [pdf, other

    cs.SE cs.AI cs.LG

    Functional Overlap Reranking for Neural Code Generation

    Authors: Hung Quoc To, Minh Huynh Nguyen, Nghi D. Q. Bui

    Abstract: Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters. We introduce SRank, a novel reranking strategy for selecting the best solutions from code… ▽ More

    Submitted 22 June, 2024; v1 submitted 16 October, 2023; originally announced November 2023.

    Comments: EMNLP 2024, Long Findings

  6. arXiv:2306.06347  [pdf, other

    cs.SE

    DocChecker: Bootstrap** Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies

    Authors: Anh T. V. Dau, ** L. C. Guo, Nghi D. Q. Bui

    Abstract: Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current… ▽ More

    Submitted 2 February, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

    Journal ref: EACL 2024 - Demonstration track

  7. arXiv:2306.00029  [pdf, other

    cs.SE cs.AI

    CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

    Authors: Nghi D. Q. Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, Steven C. H. Hoi

    Abstract: Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: Ongoing work - Draft Preview

  8. arXiv:2305.07922  [pdf, other

    cs.CL cs.LG cs.PL

    CodeT5+: Open Code Large Language Models for Code Understanding and Generation

    Authors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi

    Abstract: Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limi… ▽ More

    Submitted 20 May, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: 26 pages, preprint

  9. arXiv:2305.06156  [pdf, other

    cs.CL cs.AI cs.PL cs.SE

    The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

    Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, ** Guo, Nghi D. Q. Bui

    Abstract: We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text… ▽ More

    Submitted 30 October, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023, Long Findings

  10. arXiv:2305.01384  [pdf, other

    cs.CL cs.LG

    Class based Influence Functions for Error Detection

    Authors: Thang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran, Dang Huu-Tien, Hieu Ngoc Nguyen, Anh T. V. Dau, Nghi D. Q. Bui

    Abstract: Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 2023

  11. arXiv:2304.01228  [pdf, other

    cs.CL cs.AI

    Better Language Models of Code through Self-Improvement

    Authors: Hung Quoc To, Nghi D. Q. Bui, ** Guo, Tien N. Nguyen

    Abstract: Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained d… ▽ More

    Submitted 9 May, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: Accepted to Findings, ACL 2023

  12. arXiv:2211.14875  [pdf, other

    cs.SE cs.CL

    Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5

    Authors: Nghi D. Q. Bui, Yue Wang, Steven Hoi

    Abstract: Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In t… ▽ More

    Submitted 22 December, 2022; v1 submitted 27 November, 2022; originally announced November 2022.

    Comments: Accepted to EMNLP 2022 Findings Track

  13. arXiv:2205.15479  [pdf, other

    cs.SE cs.AI cs.PL

    HierarchyNet: Learning to Summarize Source Code with Heterogeneous Representations

    Authors: Minh Huynh Nguyen, Nghi D. Q. Bui, Truong Son Hy, Long Tran-Thanh, Tien N. Nguyen

    Abstract: We propose a novel method for code summarization utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs effectively capture essential code features at lexical, syntactic, and semantic levels by abstracting coarse-grained code elements and incorporating fine-grained program elements in a hierarchical structure. Our HierarchyNet method processes each layer… ▽ More

    Submitted 9 May, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

  14. arXiv:2205.13022  [pdf, ps, other

    cs.SE cs.AI cs.PL

    Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

    Authors: Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi D. Q. Bui

    Abstract: Despite the recent trend of develo** and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the… ▽ More

    Submitted 2 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: The 37th IEEE/ACM International Conference on Automated Software Engineering

  15. arXiv:2112.11226   

    cs.LG cs.AI cs.PL cs.SE

    Energy-bounded Learning for Robust Models of Code

    Authors: Nghi D. Q. Bui, Yijun Yu

    Abstract: In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation… ▽ More

    Submitted 9 May, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: There are some flaws in our experiments, we would like to fix it and publish a fixed version again in the very near future

  16. arXiv:2012.07023  [pdf, other

    cs.SE cs.AI cs.LG cs.PL

    InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

    Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

    Abstract: Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other t… ▽ More

    Submitted 15 December, 2020; v1 submitted 13 December, 2020; originally announced December 2020.

    Comments: Accepted at ICSE 2021

  17. arXiv:2009.09777  [pdf, other

    cs.SE cs.AI cs.PL

    TreeCaps: Tree-Based Capsule Networks for Source Code Processing

    Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

    Abstract: Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., Abstract Syntax Trees) and/or semantic information (e.g., Dependency Graphs). Although graphs may be better at capturing various viewpoints of code semantics than trees, constructing graph inputs from code needs static code semantic analysis that may not be accurate and introduces… ▽ More

    Submitted 14 December, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

    Comments: Accepted at AAAI 2021

  18. arXiv:2009.02731  [pdf, other

    cs.SE cs.AI cs.LG cs.PL

    Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

    Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

    Abstract: We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in… ▽ More

    Submitted 23 May, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

    Comments: Accepted at SIGIR 2021

  19. On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations

    Authors: Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour

    Abstract: With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they gen… ▽ More

    Submitted 18 March, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

    Comments: Information and Software Technology, IST Journal 2021, Elsevier. Related to arXiv:2004.07313

  20. arXiv:1910.12306  [pdf, ps, other

    cs.LG cs.SE stat.ML

    TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing

    Authors: Vinoj Jayasundara, Nghi Duy Quoc Bui, Lingxiao Jiang, David Lo

    Abstract: Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to red… ▽ More

    Submitted 27 October, 2019; originally announced October 2019.

    Comments: in NeurIPS Workshop on ML for Systems, 2019

  21. arXiv:1906.03835  [pdf, other

    cs.LG cs.SE stat.ML

    SAR: Learning Cross-Language API Map**s with Little Knowledge

    Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

    Abstract: To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying th… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

    Comments: Accepted at the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

  22. arXiv:1803.04715  [pdf, other

    cs.LG cs.CL cs.SE

    Hierarchical Learning of Cross-Language Map**s through Distributed Vector Representations for Code

    Authors: Nghi D. Q. Bui, Lingxiao Jiang

    Abstract: Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach… ▽ More

    Submitted 13 March, 2018; originally announced March 2018.

    Comments: Accepted at ICSE'18

  23. arXiv:1710.06159  [pdf, other

    cs.LG

    Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks

    Authors: Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu

    Abstract: Towards the vision of translating code that implements an algorithm from one programming language into another, this paper proposes an approach for automated program classification using bilateral tree-based convolutional neural networks (BiTBCNNs). It is layered on top of two tree-based convolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual… ▽ More

    Submitted 29 November, 2017; v1 submitted 17 October, 2017; originally announced October 2017.

    Comments: Accepted at NL4SE Workshop, AAAI'18