Evaluating Code Summarization with Improved Correlation with Human Assessment

2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS)(2021)

引用 2|浏览0
暂无评分
摘要
Code summarization aims to automatically generate functionality descriptions of code snippets. Faithful metrics are needed to measure to which degree the machine generated summaries capture the semantics of the code snippets. Most commonly used metrics in code summarization, such as BLEU -4, METEOR, and ROUGE-L, originate from machine translation and text summarization, and have constantly been found to be inconsistent with human assessment. In this paper, we propose a novel evaluation metric, Consensus-based Code Summarization Evaluation (CCSE), which assigns different semantic weights to the n-grams of the summary. We also provide an algorithm to match the n-gram pairs from the reference and candidate based on the similarities. To validate the effectiveness of our proposed metric, we collect summary pairs from two public Java datasets and calculate the correlation coefficients between CCSE and the human evaluations. The experiment results show that, compared with BLEU-4, METEOR, and ROUGE-L, CCSE is more consistent with the scores assessed by human developers.
更多
查看译文
关键词
Source code summarization,Deep neural networks,Automatic evaluation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要