Skip to main content

Showing 1–16 of 16 results for author: Zan, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10130  [pdf, other

    cs.CL

    The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

    Authors: Yan Liu, Yu Liu, Xiaokang Chen, Pin-Yu Chen, Daoguang Zan, Min-Yen Kan, Tsung-Yi Ho

    Abstract: Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing me… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  2. arXiv:2406.07003  [pdf, other

    cs.SE

    GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

    Authors: Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi **, Qianxiang Wang

    Abstract: The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit less satisfactory performance on repository-level completion due to the lack of repository-specific knowledge in these LLMs. To address this problem, we propose… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  3. arXiv:2406.01304  [pdf, other

    cs.CL cs.AI cs.SE

    CodeR: Issue Resolving with Multi-Agent and Task Graphs

    Authors: Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, Qianxiang Wang

    Abstract: GitHub issue resolving recently has attracted significant attention from academia and industry. SWE-bench is proposed to measure the performance in resolving issues. In this paper, we propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository. On SWE-bench lite, CodeR is able to solve 28.33% of issue… ▽ More

    Submitted 10 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: https://github.com/NL2Code/CodeR

  4. arXiv:2403.16443  [pdf, other

    cs.CL cs.AI cs.SE

    CodeS: Natural Language to Code Repository via Multi-Layer Sketch

    Authors: Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan, Zhiguang Yang, Yongji Wang, Qianxiang Wang, Lizhen Cui

    Abstract: The impressive performance of large language models (LLMs) on code-related tasks has shown the potential of fully automated software development. In light of this, we introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo). This task aims to generate an entire code repository from its natural language requirements. To address this task, we propose a simple y… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: https://github.com/NL2Code/CodeS

  5. arXiv:2401.14242  [pdf, other

    cs.CL

    Improving Natural Language Capability of Code Large Language Model

    Authors: Wei Li, Daoguang Zan, Bei Guan, Ailun Yu, Xiaolin Chen, Yongji Wang

    Abstract: Code large language models (Code LLMs) have demonstrated remarkable performance in code generation. Nonetheless, most existing works focus on boosting code LLMs from the perspective of programming capabilities, while their natural language capabilities receive less attention. To fill this gap, we thus propose a novel framework, comprising two modules: AttentionExtractor, which is responsible for e… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  6. arXiv:2401.08984  [pdf, other

    cs.LG cs.AI cs.CR

    A GAN-based data poisoning framework against anomaly detection in vertical federated learning

    Authors: Xiaolin Chen, Daoguang Zan, Wei Li, Bei Guan, Yongji Wang

    Abstract: In vertical federated learning (VFL), commercial entities collaboratively train a model while preserving data privacy. However, a malicious participant's poisoning attack may degrade the performance of this collaborative model. The main challenge in achieving the poisoning attack is the absence of access to the server-side top model, leaving the malicious participant without a clear target model.… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: 6 pages, 7 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  7. arXiv:2308.16824  [pdf, other

    cs.CL cs.AI cs.PL cs.SE

    Can Programming Languages Boost Each Other via Instruction Tuning?

    Authors: Daoguang Zan, Ailun Yu, Bo Shen, Jiaxin Zhang, Taihong Chen, Bing Geng, Bei Chen, Jichuan Ji, Yafen Yao, Yongji Wang, Qianxiang Wang

    Abstract: When human programmers have mastered a programming language, it would be easier when they learn a new programming language. In this report, we focus on exploring whether programming languages can boost each other during the instruction fine-tuning phase of code large language models. We conduct extensive experiments of 8 popular programming languages (Python, JavaScript, TypeScript, C, C++, Java,… ▽ More

    Submitted 3 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: Work in progress

  8. arXiv:2307.15370  [pdf, other

    cs.SE

    Private-Library-Oriented Code Generation with Large Language Models

    Authors: Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, Yongji Wang

    Abstract: Large language models (LLMs), such as Codex and GPT-4, have recently showcased their remarkable code generation abilities, facilitating a significant boost in coding efficiency. This paper will delve into utilizing LLMs for code generation in private libraries, as they are widely employed in everyday programming. Despite their remarkable capabilities, generating such private APIs poses a formidabl… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  9. arXiv:2307.14936  [pdf, other

    cs.CL cs.AI cs.LG cs.PL cs.SE

    PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

    Authors: Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, **gyang Zhao, Yuenan Guo, Qianxiang Wang

    Abstract: Large Language Models for Code (Code LLM) are flourishing. New and powerful models are released on a weekly basis, demonstrating remarkable performance on the code generation task. Various approaches have been proposed to boost the code generation performance of pre-trained Code LLMs, such as supervised fine-tuning, instruction tuning, reinforcement learning, etc. In this paper, we propose a novel… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

    Comments: Preprint

  10. arXiv:2305.15377  [pdf, other

    cs.CL

    Uncovering and Quantifying Social Biases in Code Generation

    Authors: Yan Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang Lou, Pin-Yu Chen, Tsung-Yi Ho

    Abstract: With the popularity of automatic code generation tools, such as Copilot, the study of the potential hazards of these tools is gaining importance. In this work, we explore the social bias problem in pre-trained code generation models. We propose a new paradigm to construct code prompts and successfully uncover social biases in code generation models. To quantify the severity of social biases in gen… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  11. arXiv:2304.07506  [pdf, other

    cs.IR cs.AI

    Hierarchical and Contrastive Representation Learning for Knowledge-aware Recommendation

    Authors: Bingchao Wu, Yangyuxuan Kang, Daoguang Zan, Bei Guan, Yongji Wang

    Abstract: Incorporating knowledge graph into recommendation is an effective way to alleviate data sparsity. Most existing knowledge-aware methods usually perform recursive embedding propagation by enumerating graph neighbors. However, the number of nodes' neighbors grows exponentially as the hop number increases, forcing the nodes to be aware of vast neighbors under this recursive propagation for distilling… ▽ More

    Submitted 15 April, 2023; originally announced April 2023.

    Comments: Accepted by ICME 2023

  12. arXiv:2303.12570  [pdf, other

    cs.CL cs.AI cs.PL cs.SE

    RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

    Authors: Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, ** Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, Weizhu Chen

    Abstract: The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion… ▽ More

    Submitted 20 October, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

    Comments: accepted by EMNLP 2023 main conference

  13. arXiv:2212.09420  [pdf, other

    cs.SE cs.AI cs.CL cs.PL

    Large Language Models Meet NL2Code: A Survey

    Authors: Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Yongji Wang, Jian-Guang Lou

    Abstract: The task of generating code from a natural language description, or NL2Code, is considered a pressing and significant challenge in code intelligence. Thanks to the rapid development of pre-training techniques, surging large language models are being proposed for code, sparking the advances in NL2Code. To facilitate further research and applications in this field, in this paper, we present a compre… ▽ More

    Submitted 8 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted to the main conference of ACL 2023 (long paper)

  14. arXiv:2210.17236  [pdf, other

    cs.PL cs.CL cs.SE

    When Language Model Meets Private Library

    Authors: Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, Jian-Guang Lou

    Abstract: With the rapid development of pre-training techniques, a number of language models have been pre-trained on large-scale code corpora and perform well in code generation. In this paper, we investigate how to equip pre-trained language models with the ability of code generation for private libraries. In practice, it is common for programmers to write code using private libraries. However, this is a… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 Findings

  15. arXiv:2207.10397  [pdf, other

    cs.CL cs.AI cs.PL cs.SE

    CodeT: Code Generation with Generated Tests

    Authors: Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen

    Abstract: The task of generating code solutions for a given programming problem can benefit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select the most appropriate solution from the multiple samples generated by the pre-trained language models. A natural way to evaluate the quality and correctness of a… ▽ More

    Submitted 23 November, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

  16. arXiv:2206.06888  [pdf, other

    cs.SE cs.CL cs.PL

    CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation

    Authors: Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, Jian-Guang Lou

    Abstract: Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large-scale unlabelled code corpora and perform well in code generation. In this paper, we… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: Accepted for publication at IJCAI-ECAI 2022