CodeS: Towards Code Model Generalization Under Distribution Shift

Hu, Qiang; Guo, Yuejun; Xie, Xiaofei; Cordy, Maxime; Ma, Lei; Papadakis, Mike; Traon, Yves Le

Computer Science > Software Engineering

arXiv:2206.05480 (cs)

[Submitted on 11 Jun 2022 (v1), last revised 4 Feb 2023 (this version, v2)]

Title:CodeS: Towards Code Model Generalization Under Distribution Shift

Authors:Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, Yves Le Traon

View PDF

Abstract:Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the big code era, limited progress has been made on distribution shift analysis and benchmarking for source code tasks. To fill this gap, this paper initiates to propose CodeS, a distribution shift benchmark dataset, for source code learning. Specifically, CodeS supports two programming languages (Java and Python) and five shift types (task, programmer, time-stamp, token, and concrete syntax tree). Extensive experiments based on CodeS reveal that 1) out-of-distribution detectors from other domains (e.g., computer vision) do not generalize to source code, 2) all code classification models suffer from distribution shifts, 3) representation-based shifts have a higher impact on the model than others, and 4) pre-trained bimodal models are relatively more resistant to distribution shifts.

Comments:	accepted by ICSE'23-NIER
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2206.05480 [cs.SE]
	(or arXiv:2206.05480v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2206.05480

Submission history

From: Qiang Hu [view email]
[v1] Sat, 11 Jun 2022 09:32:29 UTC (2,952 KB)
[v2] Sat, 4 Feb 2023 09:43:17 UTC (98 KB)

Computer Science > Software Engineering

Title:CodeS: Towards Code Model Generalization Under Distribution Shift

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:CodeS: Towards Code Model Generalization Under Distribution Shift

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators