"Repository Embedding via Heterogeneous Graph Adversarial Contrastive Learning"

Yiyue Qian,Yiming Zhang,Qianlong Wen,Yanfang Ye (),Chuxu Zhang

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining（2022）

引用 8|浏览6

暂无评分

摘要

Driven by the exponential increase of software and the advent of the pull-based development system Git, a large amount of open-source software has emerged on various social coding platforms. GitHub, as the largest platform, not only attracts developers and researchers to contribute legitimate software and research-related source code but has also become a popular platform for an increasing number of cybercriminals to perform continuous cyberattacks. Hence, some tools have been developed to learn representations of repositories on GitHub for various related applications (e.g., malicious repository detection) recently. However, most of them merely focus on code content while ignoring the rich relational data among repositories. In addition, they usually require a mass of resources to obtain sufficient labeled data for model training while ignoring the usefully handy unlabeled data. To this end, we propose a novel model Rep2Vec which integrates the code content, the structural relations, and the unlabeled data to learn the repository representations. First, to comprehensively model the repository data, we build a repository heterogeneous graph (Rep-HG) which is encoded by a graph neural network. Afterwards, to fully exploit unlabeled data in Rep-HG, we introduce adversarial attacks to generate more challenging contrastive pairs for the contrastive learning module to train the encoder in node view and meta-path view simultaneously. To alleviate the workload of the encoder against attacks, we further design a dual-stream contrastive learning module that integrates contrastive learning on adversarial graph and original graph together. Finally, the pre-trained encoder is fine-tuned to the downstream task, and further enhanced by a knowledge distillation module. Extensive experiments on the collected dataset from GitHub demonstrate the effectiveness of Rep2Vec in comparison with state-of-the-art methods for multiple repository tasks.

查看译文

关键词

rep2vec,repository,learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要