Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection
CoRR(2023)
摘要
Source code clone detection is the task of finding code fragments that have
the same or similar functionality, but may differ in syntax or structure. This
task is important for software maintenance, reuse, and quality assurance (Roy
et al. 2009). However, code clone detection is challenging, as source code can
be written in different languages, domains, and styles. In this paper, we argue
that source code is inherently a graph, not a sequence, and that graph-based
methods are more suitable for code clone detection than sequence-based methods.
We compare the performance of two state-of-the-art models: CodeBERT (Feng et
al. 2020), a sequence-based model, and CodeGraph (Yu et al. 2023), a
graph-based model, on two benchmark data-sets: BCB (Svajlenko et al. 2014) and
PoolC (PoolC no date). We show that CodeGraph outperforms CodeBERT on both
data-sets, especially on cross-lingual code clones. To the best of our
knowledge, this is the first work to demonstrate the superiority of graph-based
methods over sequence-based methods on cross-lingual code clone detection.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要