Highly accurate phishing URL detection based on machine learning

Journal of Ambient Intelligence and Humanized Computing(2022)

引用 1|浏览0
暂无评分
摘要
Phishing is a persistent and major threat on the internet that is growing steadily and dangerously. It is a type of cyber-attack, in which phisher mimics a legitimate website page to harvest victim’s sensitive information, such as usernames, emails, passwords and bank or credit card details. To prevent such attacks, several phishing detection techniques have been proposed such as AI based, 3rd party, heuristic and content based. However, these approaches suffer from a number of limitations that needs to be addressed in order to detect phishing URLs. Firstly, features extracted in the past are extensive, with a limitation that it takes a considerable amount of time to extract such features. Secondly, several approaches selected important features using statistical methods, while some propose their own features. Although both methods have been implemented successfully in various approaches, however, these methods produce incorrect results without amplification of domain knowledge. Thirdly, most of the literature has used pre-classified and smaller datasets, which fail to produce exact efficiency and precision on large and real world datasets. Fourthly, the previous proposed approaches lack in advanced evaluation measures. Hence, in this paper, effective machine learning framework is proposed, which predicts phishing URLs without visiting the webpage nor utilizing any 3rd party services. The proposed technique is based on URL and uses full URL, protocol scheme, hostname, path area of the URL, entropy feature, suspicious words and brand name matching using TF-IDF technique for the classification of phishing URLs. The experiments are carried out on six different datasets using eight different machine learning classifiers, in which Random Forest achieved a significant higher accuracy than other classifiers on all the datasets. The proposed framework with only 30 features achieved a higher accuracy of 96.25% and 94.65% on the Kaggle datasets. The comparative results show that the proposed model achieved an accuracy of 92.2%, 91.63%, 94.80, 96.85% on benchmark datasets, which is higher than the existing approaches.
更多
查看译文
关键词
Phishing detection,URL detection,Cybercrime,Machine learning,Random forest (RF),Decision tree (J48)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要