Pornographic and Gambling Domain Recognition Method based on Long Distance Spare Multi-Head Self-Attention Vision-and-Language Model

Luheng Wang,Zhaoxin Zhang,Feng Ye

2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA)(2022)

引用 0|浏览20
Pornographic and gambling websites have become more and more difficult to identify and detect through various camouflage technologies, which seriously endangers the healthy development of the Internet. Most traditional methods cannot deal with the complex situations such as hijacking normal domain names. Moreover, the Transformers deep learning models–which have achieved good results in various tasks recent years–have huge parameters and are not suitable for long text tasks. It is observed that the pornography and gambling items possess two features: one is texts and images are the main contents, the other is content text has significant length. Regarding at solving the problem with above features, this paper proposes a method to identify pornographic and gambling domain names by using a vision and language multimodal model based on Long Distance Spare Multi-Head Attention. The detail steps of the methods including introducing the Long Distance Spare Multi-Head Self-Attention into the multimodal model, deleting a certain percentage of the subtree of all text elements in HTML DOM that do not contain a certain character size combined with the text characteristics of web pages, filtering out the header, footer, copyright, form and deleting all tags, and fusing the multi-point output of the model in the downstream classification task to further fuse the multi-point information. The accuracy, recall and F1 of identifying gambling and pornographic domain names have reached 95%.
Transformers,Multimodal model,Long Distance Spare Multi-Head Self-Attention
AI 理解论文
Chat Paper