ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
arxiv(2024)
摘要
Retrieval-based code question answering seeks to match user queries in
natural language to relevant code snippets. Previous approaches typically rely
on pretraining models using crafted bi-modal and uni-modal datasets to align
text and code representations. In this paper, we introduce ProCQA, a
large-scale programming question answering dataset extracted from the
StackOverflow community, offering naturally structured mixed-modal QA pairs. To
validate its effectiveness, we propose a modality-agnostic contrastive
pre-training approach to improve the alignment of text and code representations
of current code language models. Compared to previous models that primarily
employ bimodal and unimodal pairs extracted from CodeSearchNet for
pre-training, our model exhibits significant performance improvements across a
wide range of code retrieval benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要