Audio and video contains lots of redundancy. This is important because this is how we can easily detect files that have been reencoded using some automated process, even if it introduces some changes like slight changes of frequency or speed of playing, non fully linear morphing (including minor rotations of images), or cropping or forced saturations of some colors or introduction of some randomized noise. We know a lot about the perceptual model to day, and that's why we can compress Audio/Video so well.
The same is not true for software, which will stop working completely if it is randomized. For a software to be harmful, it has to preserve soem structure. For this reason, many malwares are not really mutating, they are instead trying to insert dummy sequences of code in the work flow (such as adding two registers and not using their result, or inserting NOPs that are anyway very easy to detect). Fuzzy hashing for code (suites of instructions) should be able to ignore some sequences of code that are known or veried to do nothing, so that it can only concentrate on hashing what is really useful.
Malwres also tend to overcomplexify the code needed to call some functions or APIs, they try to read more data than needed from a known source which, by itself seems harmless, such as getting functions parameter values from a data block that has been scrambled and all addresses into that block recomputed. One way to protect against them is to monitor exactly where the system APIs are effectively called, independantly of how parameters were computed.
If you look at a program, you will immediately realize that some instructions are fully independant of each other and can be computed in any random order. malwares are using that to scramble the data flow and code flow in so short sequences that it becomes difficult to predict in which order they will execute (the number of rearrangements is combinatorial, faster than exponential, so it's simply impossible to predict how many hashes you'll need to compute all possible hashes).
One solution is to use a code/data flow analyzer, in order to detect dependencies, and all independant branches (that can be executed in any order) being stripped down to their simplest form each, and then sort them using some simple sort algottithm, before hashing then the sorted (but still equivalent) code sequence.
In fact a clear sign of malware is now some code that appears to be exagerately underoptimized. With today advances in compiler technologies, a secure and safe software should alsways be optimized as much as possible, leaving very little place for speed and code size improvement: constants are preevaluated, common subexpressions are moved out of loops, register allocation is maximized to avoid unnecessary rereads or rewrites, and the overall code structure should adopt a well known design of "best practices", without using too many jumps (a malware is typically full of long lists of unconditional jumps, or multiple sequences of conditional jumps that will finally reach the same code position, or an equivalent redundant code.)
In fact, you can even avoid the steps of systematically analizing the code/data workflow: using a sandbox to virtualize most of what the malware is trying to do, you can let it run on theses long unoptimized sequences until they reach a system API call : what you need to hash will then only be the computed system API parameter values. You don't really need to hash the code itself. So today's antivirus are doing this because it is very efficient. What is hashed is no longer the brute code itself but what it does to interface with the rest of the system (it does not matter what the malware does inside itself, between these interfacing "hot spots", except that it may take too long time to do that and could cause performance problems if the code is not under some CPU/memory quota restriction enforced by the sandbox, that will analyze what happens in the suspected malware if it uses too much ressources internally).
Keep Up with TechRepublic