An improved Simhash algorithm based malicious mirror website detection method

Guangxuan Chen,Guangxiao Chen, Di Wu, Qiang Liu, Lei Zhang,Xiaoshi Fan

Journal of Physics: Conference Series(2021)

引用 2|浏览2
暂无评分
摘要
Abstract There are a large number of similar or even identical webpages on the Internet. These webpages will cause unnecessary loss of network resources, including waste of storage space, decreased web search speed, and decreased user experience. And some malicious mirror websites will become tools for criminals to carry out illegal activities such as phishing attacks. In this paper, the autours analyzed the mainstream text similarity detection algorithms and webpage deduplication algorithms, and proposed an improved webpage deduplication algorithm based on Simhash. The algorithm converts the text collection into Simhash fingerprints for storage through mapping, and calculates the similarity of the two fingerprints through Hamming distance, thereby obtaining the similarity of the webpage. Experiments show that the algorithm proposed in this paper has a higher accuracy rate and recall rate, and can be better applied to the identification and detection of malicious mirror websites.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要