Security optimize

Fuzzy hashing helps researchers spot morphing malware

Anti-malware applications are dependent on malicious software remaining the same, code-wise. So the bad guys just keep changing it. Michael Kassner looks at a possible defense: fuzzy hashing.

It's obvious, but I have to say it. There's something wrong with existing efforts to squash malware. What's wrong varies from expert to expert. But there's consensus. Anti-malware apps have reactive components. And, that's not a good thing, but required until something better surfaces.

Signature-based detection is one such component. Signatures, also known as fingerprints, are cryptographic hashes of individual pieces of malware code. The anti-malware developer compiles the fingerprints and appropriate heuristic parameters into a database that is used by the anti-malware software to detect malware.

There it is. The malware has to be in hand before a signature can be made.

Why use hashing?

I wasn't prepared for the results when I googled "hash". Smoke it? My son let go a now familiar sigh as he proofed this. Oh well. I did find what I was looking for.

Cryptographic hashes are used for many reasons. The one relevant to our discussion is faster processing. Remember how anti-malware apps work. They scan the entire file looking for malicious code. That takes time and the user will be sensitive to an excessive amount when downloading.

Instead, why not convert both the test snippet of malcode and the suspicious file into individual digital hashes (much faster than scanning).

Tough topic

In the beginning, I was concerned about taking on a subject that involves hashing. There are parts of the process I do not understand. Mistakenly, I mentioned this to my mentor. I should have expected his comeback. "You damn well better figure it out. The readers deserve that much."

Not wanting any more of his guff, I mumbled something about needing to go, hung up, and got to work.

What is hashing?

For those so inclined this link leads to Wolfram's mathematical explanation of hashing. Being a Wikipedia sort of guy, I prefer this definition:

"In computer science, a fingerprinting (hashing) algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes."

My take (simplistic indeed): Digital signatures/fingerprints are magically computed from a chunk of actual malcode using a hash algorithm.

Using the identical hash algorithm, the anti-malware app works the same magic on the file being tested. If the end products are not identical, the file is considered free of malware.

There lies the rub:"exactly identical" is required.

The problem

The process of testing by comparing hash results is kind of like how I view the world. Things are either black or white; gray does not compute. Understanding this, malware developers constantly morph their code.

Then, when a file infected with the newly-altered malware is scanned, the results are different from anything in the signature database. Home free and clear, until:

  • The new malware variant is noticed by the anti-malware developer.
  • The malware is reverse engineered to understand what it does.
  • A new signature/fingerprint is created.
  • The signature is pushed out to all existing client applications.

Man. That has to be frustrating for the good guys, always behind the curve. I wonder. What if things did not have to be either black or white?

Fuzzy hashing to the rescue

As the subtitle implies, shades of gray are now possible. My introduction to fuzzy hashing came from learning how it is used to combat spam. Dr. Andrew Tridgell developed Spamsum in an attempt to identify common indices in spam email:

"Fuzzy hashing allows the discovery of potentially incriminating documents that may not be located using traditional hashing methods. The use of the fuzzy hash is much like the fuzzy logic search; it is looking for documents that are similar but not exactly the same, called homologous files."

Building on Spamsum

Mr. Jesse Kornblum is credited for taking Spamsum and the concept of context triggered piecewise hash (fuzzy hash) a step further when he developed ssdeep. He describes the tool in this paper:

"Ssdeep is a new technique for constructing hash signatures by combining a number of traditional hashes whose boundaries are determined by the context of the input. These signatures can be used to identify modified versions of known files; even if data has been inserted, modified, or deleted in the new files."

Even more magic

If you remember, I deftly avoided the math surrounding cryptographic hashing. I once again will attempt that. ("Son, I fail to see the humor in penciling ‘lame' here.")

Anyway, I came upon some interesting research by Mr. David French, Senior Member of Carnegie Mellon's Software Engineering Institute. If you remember, I mentioned that fuzzy hashing was used to spot spam. Mr. French is investigating the use of fuzzy hashing as a tool to help spot malware (link).

Here comes that deftness. After taking a completely unbiased survey, it became apparent you'd prefer to have fuzzy hashing explained by Mr. French, an academic expert.

Kassner: My attempt at describing fuzzy hashing above is inadequate. Could you please give the readers a brief overview of what it is and how it works? French: There are many techniques which may be considered fuzzy hashing. Most involve two distinct and complementary algorithms. The hash algorithm itself, and the comparison algorithm for determining whether hashes "match." In the case of ssdeep, the algorithms used are open-source, and may be examined in both the ssdeep and the original spamsum source code.

At a very high level, fuzzy hashing is a way to determine whether two inputs are similar, rather than identical. Fuzzy hashes work by chopping up the input data into either fixed-size blocks, or blocks whose size depend on the input data. The blocks are further reduced into a smaller number of data values. For example, by hashing individual blocks into single byte values.

We then take that reduced set of data values, the "fuzzy hash," and compare it to other fuzzy hashes using some comparison function. Ideally, the comparison function will provide some usable metric or distance, by which we can decide whether the inputs are similar or not.

Kassner: When did you first come in contact with fuzzy hashing and what made you consider using it to study malware detection? French: Fuzzy hashing has gained enormous popularity in the last five years. There are many publicly-available presentations and publications regarding the use of fuzzy hashing for things like digital-media forensics. There are also public studies that describe how fuzzy hashing is used to detect similar malware.

My interest in fuzzy hashing as applied to malware is in attempting to quantify how useful fuzzy hashing can be against malware. Malicious code is very different from other domains in which fuzzy hashing may be applied (such as bioinformatics) because it involves human adversaries who benefit directly from preventing others from understanding their software.

Kassner: Normal hashing can ascertain files that are exact duplicates. But, fuzzy hashing detects "almost identical" files, why is that important? French: The importance of detecting nearly identical files, specifically nearly identical malicious code, is driven by the economics of malware analysis. Human analysts are expensive to train, and their time spent analyzing malicious code is very valuable.

No software or automated process can replace the ingenuity, intuition, and the tenacity of the human analyst. However, any method that is capable of detecting substantial similarity between a file under consideration and a previously-analyzed file, has the potential to save time and money.

Kassner: Is ssdeep the application you used for your research? Could you talk about its effectiveness? French: Ssdeep is one of the applications I will be using for my research. The goal of my research is to quantify their effectiveness specifically against malicious code. I am unable to discuss my results at this time. Kassner: Do you see any other potential uses for fuzzy hashing? French: Fuzzy hashing may generally be applied to any data domain in which observing substantial similarity amongst data sets is desirable. The only limitations are whether the data domain in question can be encoded such that the fuzzy hash makes sense. Ideally, data may be arranged linearly as some kind of ordered set, such that the order and the contents are important.

Real-world example

In my quest to come to grips with fuzzy hashing, I ran across an interesting example of how it could be used. Mr. Kornblum was exchanging ideas with a few people on this forum. One of the posters suggested:

"An odd application, perhaps, but related to what my university automatically runs against student project submissions for cheating in low-level CS classes where you have a thousand students and several different human graders."

The poster continues:

"You would be surprised (or maybe not) how many people we caught with a similar process because they thought changing things the compiler strips out anyway would make their copying undetectable."

I joked about it earlier, but being able to determine shades of gray digitally is nothing less than amazing.

Final thoughts

It might not seem like it, but fuzzy hashing has the potential to change the way we view our world. For example, geneticists are using fuzzy hashing to compare the gene sequence of an unknown microorganism to that with a known genome.

I'd like to extend my thanks to Mr. French for his helpful answers and to Mr. Richard Lynch, also of Carnegie Mellon, for making it possible.

About

Information is my field...Writing is my passion...Coupling the two is my mission.

32 comments
AnsuGisalas
AnsuGisalas

Fuzzy hashing could be a strong weapon against message board spammers too... they'd have to figure out new ways to generate their SEO garbage, if they'd be automatically blocked from posting too similar responses.

mattyhiway
mattyhiway

This article was written as if this is some future technology. McAfee's enterprise VirusScan, versions 8.7 and newer, can already do this. Code named Artemis, the technology takes a hash of a suspicious file. Suspicious files are ones whose hash closely, but not exactly, match a known malware pattern. This hash is sent to McAfee's cloud-based Global Threat Intelligence, and a good/bad determination is made in milliseconds. The local host then treats the file the hash was made from as either good or bad based on that determination.

stevew
stevew

Obviously, the best method to defeat malware is to always optimize code. Unfortunately, the method you described here, is probably the closest we will come until some genius thinks up another way. At least it keeps security wonks in the green!

AnsuGisalas
AnsuGisalas

Just curious, and off topic. I tend to see only the greyscales, being distrusting of boolean values in real life. For example if you navigate the streets of an unknown city, and you see a boolean opposition between left and right, you're bound to go astray. It follows from the boolean that three lefts should take you in a direction parallel to the one you get from taking a single right... but that's only true in the rarest of occasions, in cities built up around a certain time when square was the new black.

roobb
roobb

An alternative malware tool called Quarantine was developed in Australia and marketed in this country during the mid to late 80's. All program file lengths, dates were hashed and recorded. If anything changed in a software hash - this activated the software to take action. Unfortunately the industry went on to focus on threads of infection. Maybe the old way wasn't so bad.

clint762011
clint762011

Will Malwarebytes Anti-malware software helps? Thank you. Clinton Bradshaw

Wunderbarb
Wunderbarb

The approach of fuzzy hashing is used for many years in the field of multimedia content. The goal is to detect the same song, picture, video even if it has been compressed, slightly modified... Here also, a cryptographic hash such as MD5 or SHA-1 would not work. The same song with two different compression rates would not be detected as the same song with a cryptographic hash. Thus, we use other fingerprints or perceptual hashes. As mentioned in your blog, the difficult part is in the decision engine that has to find the nearest candidate(s). Although it is far away from malware detection, if you're interested in this topic, the following gives a good introduction and good pointers for deeper diving: [1] F. Lefebvre, B. Chupeau, A. Massoudi, and E. Diehl, Image and video fingerprinting: forensic applications, Proceedings of SPIE, San Jose, CA, USA: 2009, pp. 725405-725405-9 available at http://spie.org/x648.html?product_id=806580. In any case, thank you for the blog. I will now explore the work of Mr French

Michael Kassner
Michael Kassner

Fuzzy hashing is helping researchers spot malware that the bad guys have changed slightly in order to fool normal AV signatures or footprints.

Michael Kassner
Michael Kassner

Artemis does compare hashes, but my sources point to it as cryptographic hashing not fuzzy hashing. That still would work as you explained it. Fuzzy hashing is a step further requiring less human intervention. If you have different information, could you please supply the links.

nwallette
nwallette

I really, really enjoy MK's writing. The tone of the articles is that of a wise old grandfather telling stories. The content is like listening in on a pub discussion amongst philosophers, programmers, engineers, and professors. How he manages to weld the two contrasting styles together into a successful dissertation, I'll never completely understand. :-) But I come away from it all feeling puzzled and enlightened all at once.

Michael Kassner
Michael Kassner

You are correct about minimizing vulnerabilities. The image of the "Dutch boy and his plugging the hole in the dike." come into mind.

Michael Kassner
Michael Kassner

So, my answer will be non-gray, thus non-existent. Unless you are outside your tendencies.

Michael Kassner
Michael Kassner

That is one approach. But, it introduces the problem of false positives. Or what to do if an application is correctly updated. Thanks for mentioning Quarantine, I had forgotten about it.

Michael Kassner
Michael Kassner

I will have to check. They may be using some form of fuzzy hashing.

Michael Kassner
Michael Kassner

I first gained notice of fuzzy hashing due to that paper. I then tried to make a connection with my area of interest and found the information about the researchers I mentioned. All in all, it is quite amazing. Have you any experience using ssdeep?

PhilippeV
PhilippeV

Audio and video contains lots of redundancy. This is important because this is how we can easily detect files that have been reencoded using some automated process, even if it introduces some changes like slight changes of frequency or speed of playing, non fully linear morphing (including minor rotations of images), or cropping or forced saturations of some colors or introduction of some randomized noise. We know a lot about the perceptual model to day, and that's why we can compress Audio/Video so well. The same is not true for software, which will stop working completely if it is randomized. For a software to be harmful, it has to preserve soem structure. For this reason, many malwares are not really mutating, they are instead trying to insert dummy sequences of code in the work flow (such as adding two registers and not using their result, or inserting NOPs that are anyway very easy to detect). Fuzzy hashing for code (suites of instructions) should be able to ignore some sequences of code that are known or veried to do nothing, so that it can only concentrate on hashing what is really useful. Malwres also tend to overcomplexify the code needed to call some functions or APIs, they try to read more data than needed from a known source which, by itself seems harmless, such as getting functions parameter values from a data block that has been scrambled and all addresses into that block recomputed. One way to protect against them is to monitor exactly where the system APIs are effectively called, independantly of how parameters were computed. If you look at a program, you will immediately realize that some instructions are fully independant of each other and can be computed in any random order. malwares are using that to scramble the data flow and code flow in so short sequences that it becomes difficult to predict in which order they will execute (the number of rearrangements is combinatorial, faster than exponential, so it's simply impossible to predict how many hashes you'll need to compute all possible hashes). One solution is to use a code/data flow analyzer, in order to detect dependencies, and all independant branches (that can be executed in any order) being stripped down to their simplest form each, and then sort them using some simple sort algottithm, before hashing then the sorted (but still equivalent) code sequence. In fact a clear sign of malware is now some code that appears to be exagerately underoptimized. With today advances in compiler technologies, a secure and safe software should alsways be optimized as much as possible, leaving very little place for speed and code size improvement: constants are preevaluated, common subexpressions are moved out of loops, register allocation is maximized to avoid unnecessary rereads or rewrites, and the overall code structure should adopt a well known design of "best practices", without using too many jumps (a malware is typically full of long lists of unconditional jumps, or multiple sequences of conditional jumps that will finally reach the same code position, or an equivalent redundant code.) In fact, you can even avoid the steps of systematically analizing the code/data workflow: using a sandbox to virtualize most of what the malware is trying to do, you can let it run on theses long unoptimized sequences until they reach a system API call : what you need to hash will then only be the computed system API parameter values. You don't really need to hash the code itself. So today's antivirus are doing this because it is very efficient. What is hashed is no longer the brute code itself but what it does to interface with the rest of the system (it does not matter what the malware does inside itself, between these interfacing "hot spots", except that it may take too long time to do that and could cause performance problems if the code is not under some CPU/memory quota restriction enforced by the sandbox, that will analyze what happens in the suspected malware if it uses too much ressources internally).

JCitizen
JCitizen

for home computers. I reserve judgement on Enterprise. It is hard to get excited about it after so many disasters though!

Michael Kassner
Michael Kassner

A lot of people worked hard to get me here. They deserve the credit. Hope my son doesn't see this, he'll want a larger raise.

JCitizen
JCitizen

seem to be improving. I received a free license from CNET to a new Emisoft program called Namutu. This program only took SECONDS to find all the active DRM spies in my processes!!! I was simply amazed! As I consumed different types of protected content, a new spy activates, and Namutu nails it!! Of course these are false positives, but I am SO impressed with the rapid slap down they got, that I was extremely tickled! Of course I had to exclude these MPAA and IAA processes to get my blu-ray and HD cable working, but I'm still very happy about this new type of behavioral heuristics. From the makers of Online Armor, this utility needs little updating, but it does, anyway, at least once a day, probably for white-lists. Unfortunately the giveaway period has ended, and I will probably have to purchase it next year. I cannot use Online Armor on Vista x64, so I am very happy I have the next best thing!!! :)

AnsuGisalas
AnsuGisalas

See this gray here? I am imagining it to be black. Now I am imagining it to be white. You've been talking to your mentor a lot, have you not? That there was ...cryptic.

JCitizen
JCitizen

I do believe it was either MBAM or AdAware or both that claim one detection rule can cover some tens of malware "families". If I remember the way I read it, the rules look for similar snippets of code common to each family of malware to build a definition. This sounds similar to your fuzzy hashing explanation; but maybe I'm off kilter. I notice both the Lavasoft product and MBAM have fast scanning capabilities, so maybe this helps in that area as well.

Wunderbarb
Wunderbarb

I am more on the side of content protection. But I will investigate ssdeep. New ideas are always welcome and can bring to ways of protection. I'm glad that this paper initiate your exploration in your own field. It highlights one of my belief in security: cross fertilization is key. BTW, I'm one of the authors.

nwallette
nwallette

After first reading the description of how this type of hashing works, I didn't think it had much promise. After all, malware code is going to exist of some key function calls amongst a sea of logic to steer the course, and illogic thrown in just to obfuscate the purpose. It seems to me that the instructions necessary to make our lives miserable could be small enough, and scattered liberally enough, to skirt detection quite easily. It's why pattern-matching is still the only established defense. Unfortunately, we're still the only computers capable of doing this with any degree of success. Ultimately, it doesn't seem like there's any silver bullet to malware detection. - Cryptographic hashes will establish a baseline of a system. After inventory is done, then it becomes easy to tell when something has changed. Changes should arouse suspicion (but not quite to the point of causing panic!) and then trigger further investigation. - Fuzzy hashes are useful here, if the transformation is simple enough to be caught be reordering instructions or breaking them into groups that may or may not be linear and consecutive anymore. - A lot can be gleaned from watching the system calls made by a program. There are some calls that will raise flags just by their existence. But again, this alone isn't a sign of malicious intent. (Partitioning and BIOS update tools to name a few obvious examples.) This is of course old technology used in AV utilities since the beginning of such a market. - Finally, a lot of good comes from group think. There are several online databases where filenames and sizes are recorded to aid savvy users curious about the purpose or intent of some particular executable. Adding hashes to this would be a small task. Populating them (voluntarily) by clients would increase ease of automated detection. In this way, I could see AV software go open source with much less development overhead as compared to commercial ventures relying on manual analysis.

pgit
pgit

Several years back I read, in an airline in-flight magazine of all places, that the future benefits of Moore's law would soon be effectively erased by an ever increasing need to task system resources to fighting malware. If we weren't continually improving speed and efficiency of the hardware, we'd be seeing a noticeable erosion of performance going into the future. The author speculated it to be a one-to-one relationship: Moore's law improvements - increased anti malware demand = computer performance doesn't appear to improve much anymore, not like in the heady 486/Pentium days. I registered all that as 'interesting food for thought,' but maybe there's Moore to it than I imagined at the time. ;)

Michael Kassner
Michael Kassner

I capture all sorts of enthusiasm working with members of academia.

AnsuGisalas
AnsuGisalas

It's probably one reason why you can make cryptographic hashing interesting even without calling on its mathematical sex-appeal :) No mean feat, that.

Michael Kassner
Michael Kassner

Also, I would be interested in learning more for future articles. Could you please contact me using the link in my profile.

Michael Kassner
Michael Kassner

I have not read anything about that relationship. It is interesting as the load increase is not linear. Malware requires processor time as does antimalware.