Historically speaking, people defending digital infrastructure are at a significant disadvantage. Bad guys can morph their malware tools at will, while security professionals must always be at the ready to shove out new versions of their products when previously undetected malware is discovered–often too late for the defenders who then have to clean up the mess.
And bad guys are opportunists, always casting their nets in waters teeming with unsuspecting victims. Nowhere is this more apparent than in the mobile industry, in particular devices running the Android operating system. Security companies have been reporting massive increases in malware infections. Kaspersky’s 2015 Security Bulletin reports detecting four million malware infections in 2015–a 216% increase over 2014.
Two problems with machine learning malware detection based on batch methods
Security companies have been trying to introduce proactive security products; and there has been some success over the past 10 years, particularly when machine learning has been incorporated. Security products using machine learning employ algorithms designed to distinguish between malware files and clean files, using features such as system calls, Application Programming Interfaces (APIs) invoked, resources and privileges used, and control- and data-flows inside apps’ execution to detect malicious behavior patterns.
Machine learning is an improvement, but not to the point where bad guys need to start looking for a new line of work. Annamalai Narayanan, Liu Yang, Lihui Chen, and Liu Jinliang from Nanyang Technological University, Singapore in their research paper Adaptive and Scalable Android Malware Detection through Online Learning (PDF) suggest there are two reasons why machine learning based on batch methodology is unsuitable for real-world, large-scale malware detection: population drift and volume.
Population drift: Machine learning malware detection based on batch methodology, according to the authors, assumes the malware population (training data) used to build the detection engine does not change over time. “Malware does not fit this profile,” suggests the paper’s authors. “The entire population of malware is constantly evolving due to various reasons such as exploiting new vulnerabilities, and evading novel detection techniques.” This makes the collection of malware identified today unrepresentative of malware generated in the future, defeating machine learning’s initial advantages.
Volume: The earlier mention of four million infections in 2015 attests to the volume being considered. The paper’s authors again suggest machine learning using batch methods will be severely handicapped. “Batch learners, to keep abreast with drifting populations, have to be frequently re-trained using huge volumes of data,” explains the paper. “Hence they pose severe scalability issues when used in the Android malware detection context where we have millions of samples already and thousands streaming in every day. Retraining frequently with such a volume renders them computationally impractical.”
A solution: DroidOL
In their research paper, Narayanan, Yang, Chen, and Jinliang then propose their solution called DroidOL, which they describe as:
“An accurate, adaptive, and scalable malware detection framework based on online learning, where we continuously retrain the model upon receiving each labeled sample and make predictions using the updated model.”
They then offer the following reasons as to why DroidOL is better suited than malware-detection platforms based on batch methodology:
- The detection model adapts to changes in malware features (population drift) over time, automatically.
- Large numbers of malware applications can be processed more efficiently online than using batch methods.
The diagram in Figure A depicts how DroidOL extracts features from Inter-Procedural Control-Flow Graphs (ICFGs) of malware applications, which are known to be robust against evasion and obfuscation techniques adopted by malware.
To accomplish the above, DroidOL uses the Weisfeiler-Lehman graph kernel (PDF) that supports explicit feature vector representation of graphs to extract semantic features from ICFGs.
SEE: Securing Your Mobile Enterprise (ZDNet/TechRepublic special feature)
In their paper, the researchers note, “In a large-scale comparative analysis of more than 87,000 apps, DroidOL achieves 84.29 percent accuracy outperforming two state-of-the-art malware techniques by more than 20 percent in their typical batch learning setting and more than 3 percent when they are continuously retrained.”
As to why the increase in accuracy over machine learning based on batch methods, the DroidOL platform is continuously retrained, which allows them to suggest, “This superior performance make DroidOL, in particular, and online learning based solutions, in general, better candidates for practical large-scale malware detection.”
If the amount of malware designed for Android products continues to increase at its current pace, hopefully DroidOL or similar online learning solutions will soon be available.