Unknown Malicious Code Detection - Practical Issues

PROCEEDINGS OF THE 7TH EUROPEAN CONFERENCE ON INFORMATION WARFARE AND SECURITY(2008)

引用 24|浏览2
暂无评分
摘要
The recent growth in Internet usage has motivated the creation of new malicious code for various purposes, including information warfare. Today's signature-based anti-viruses can detect accurately known malicious code but are very limited in detecting new malicious code. New malicious codes are being created every day, and their number is expected to increase in the coming years. Recently, machine learning methods, such as classification algorithms, were used successfully for the detection of unknown malicious code. These studies were based on a test collection with a limited size of less than 3,000 files, and the proportions of malicious and benign files in both the training and test sets were identical. These test collections do not correspond to real life conditions, in which the percentage of malicious files is significantly lower than that of the benign files. In this study we present a methodology for the detection of unknown malicious code. The executable binary code is represented by n-grams. We performed an extensive evaluation using a test collection of more than 30,000 files, in which we investigated the imbalance problem. Five levels of Malicious Files Percentage (MFP) in the training set (16.7, 33.4, 50, 66.7 and 83.4%) were used to train classifiers. 17 levels of MFP (5, 7.5, 10, 12.5, 15, 20, 30, 40, 50, 60, 70, 80, 85, 87.5, 90, 92.5 and 95%) were set in the test set to represent various benign/malicious files ratio during the detection. Our evaluation results suggest that varying classification algorithms react differently to the various benign/malicious files ratio. For 10% MFP in the test set, representing real life conditions, in general the highest performance achieved for the use of less than 33.3% MFP in the training set, and in specific classifiers was above 95% of accuracy was achieved. Additionally we present a chronological evaluation, in which the dataset from 2000 to 2007 was divided to training sets and tests sets. Evaluation results show that an update in the training set is needed.
更多
查看译文
关键词
Malicious code detection,anti virus,machine learning,text categorization,imbalance problem
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要