A Study of Detecting Computer Viruses in Real-Infected Files in the N-Gram Representation With Machine Learning Methods
Machine learning methods were successfully applied in recent years for detecting new and unseen computer viruses. The viruses were, however, detected in small virus loader files and not in real infected executable files. The authors created data sets of benign files, virus loader files and real infected executable files and represented the data as collections of n-grams. Histograms of the relative frequency of the n-gram collections indicate that detecting viruses in real infected executable files with machine learning methods is nearly impossible in the n-gram representation. This statement is underpinned by exploring the n-gram representation from an information theoretic perspective and empirically by performing classification experiments with machine learning methods.