It’s obvious, but I have to say it. There’s something wrong with existing efforts to squash malware. What’s wrong varies from expert to expert. But there’s consensus. Anti-malware apps have reactive components. And, that’s not a good thing, but required until something better surfaces.

Signature-based detection is one such component. Signatures, also known as fingerprints, are cryptographic hashes of individual pieces of malware code. The anti-malware developer compiles the fingerprints and appropriate heuristic parameters into a database that is used by the anti-malware software to detect malware.

There it is. The malware has to be in hand before a signature can be made.

Why use hashing?

I wasn’t prepared for the results when I googled “hash”. Smoke it? My son let go a now familiar sigh as he proofed this. Oh well. I did find what I was looking for.

Cryptographic hashes are used for many reasons. The one relevant to our discussion is faster processing. Remember how anti-malware apps work. They scan the entire file looking for malicious code. That takes time and the user will be sensitive to an excessive amount when downloading.

Instead, why not convert both the test snippet of malcode and the suspicious file into individual digital hashes (much faster than scanning).

Tough topic

In the beginning, I was concerned about taking on a subject that involves hashing. There are parts of the process I do not understand. Mistakenly, I mentioned this to my mentor. I should have expected his comeback. “You damn well better figure it out. The readers deserve that much.”

Not wanting any more of his guff, I mumbled something about needing to go, hung up, and got to work.

What is hashing?

For those so inclined this link leads to Wolfram’s mathematical explanation of hashing. Being a Wikipedia sort of guy, I prefer this definition:

“In computer science, a fingerprinting (hashing) algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes.”

My take (simplistic indeed): Digital signatures/fingerprints are magically computed from a chunk of actual malcode using a hash algorithm.

Using the identical hash algorithm, the anti-malware app works the same magic on the file being tested. If the end products are not identical, the file is considered free of malware.

There lies the rub:”exactly identical” is required.

The problem

The process of testing by comparing hash results is kind of like how I view the world. Things are either black or white; gray does not compute. Understanding this, malware developers constantly morph their code.

Then, when a file infected with the newly-altered malware is scanned, the results are different from anything in the signature database. Home free and clear, until:

  • The new malware variant is noticed by the anti-malware developer.
  • The malware is reverse engineered to understand what it does.
  • A new signature/fingerprint is created.
  • The signature is pushed out to all existing client applications.

Man. That has to be frustrating for the good guys, always behind the curve. I wonder. What if things did not have to be either black or white?

Fuzzy hashing to the rescue

As the subtitle implies, shades of gray are now possible. My introduction to fuzzy hashing came from learning how it is used to combat spam. Dr. Andrew Tridgell developed Spamsum in an attempt to identify common indices in spam email:

“Fuzzy hashing allows the discovery of potentially incriminating documents that may not be located using traditional hashing methods. The use of the fuzzy hash is much like the fuzzy logic search; it is looking for documents that are similar but not exactly the same, called homologous files.”

Building on Spamsum

Mr. Jesse Kornblum is credited for taking Spamsum and the concept of context triggered piecewise hash (fuzzy hash) a step further when he developed ssdeep. He describes the tool in this paper:

“Ssdeep is a new technique for constructing hash signatures by combining a number of traditional hashes whose boundaries are determined by the context of the input. These signatures can be used to identify modified versions of known files; even if data has been inserted, modified, or deleted in the new files.”

Even more magic

If you remember, I deftly avoided the math surrounding cryptographic hashing. I once again will attempt that. (“Son, I fail to see the humor in penciling ‘lame’ here.”)

Anyway, I came upon some interesting research by Mr. David French, Senior Member of Carnegie Mellon’s Software Engineering Institute. If you remember, I mentioned that fuzzy hashing was used to spot spam. Mr. French is investigating the use of fuzzy hashing as a tool to help spot malware (link).

Here comes that deftness. After taking a completely unbiased survey, it became apparent you’d prefer to have fuzzy hashing explained by Mr. French, an academic expert.

Kassner: My attempt at describing fuzzy hashing above is inadequate. Could you please give the readers a brief overview of what it is and how it works?
French: There are many techniques which may be considered fuzzy hashing. Most involve two distinct and complementary algorithms. The hash algorithm itself, and the comparison algorithm for determining whether hashes “match.” In the case of ssdeep, the algorithms used are open-source, and may be examined in both the ssdeep and the original spamsum source code.

At a very high level, fuzzy hashing is a way to determine whether two inputs are similar, rather than identical. Fuzzy hashes work by chopping up the input data into either fixed-size blocks, or blocks whose size depend on the input data. The blocks are further reduced into a smaller number of data values. For example, by hashing individual blocks into single byte values.

We then take that reduced set of data values, the “fuzzy hash,” and compare it to other fuzzy hashes using some comparison function. Ideally, the comparison function will provide some usable metric or distance, by which we can decide whether the inputs are similar or not.

Kassner: When did you first come in contact with fuzzy hashing and what made you consider using it to study malware detection?
French: Fuzzy hashing has gained enormous popularity in the last five years. There are many publicly-available presentations and publications regarding the use of fuzzy hashing for things like digital-media forensics. There are also public studies that describe how fuzzy hashing is used to detect similar malware.

My interest in fuzzy hashing as applied to malware is in attempting to quantify how useful fuzzy hashing can be against malware. Malicious code is very different from other domains in which fuzzy hashing may be applied (such as bioinformatics) because it involves human adversaries who benefit directly from preventing others from understanding their software.

Kassner: Normal hashing can ascertain files that are exact duplicates. But, fuzzy hashing detects “almost identical” files, why is that important?
French: The importance of detecting nearly identical files, specifically nearly identical malicious code, is driven by the economics of malware analysis. Human analysts are expensive to train, and their time spent analyzing malicious code is very valuable.

No software or automated process can replace the ingenuity, intuition, and the tenacity of the human analyst. However, any method that is capable of detecting substantial similarity between a file under consideration and a previously-analyzed file, has the potential to save time and money.

Kassner: Is ssdeep the application you used for your research? Could you talk about its effectiveness?
French: Ssdeep is one of the applications I will be using for my research. The goal of my research is to quantify their effectiveness specifically against malicious code. I am unable to discuss my results at this time.
Kassner: Do you see any other potential uses for fuzzy hashing?
French: Fuzzy hashing may generally be applied to any data domain in which observing substantial similarity amongst data sets is desirable. The only limitations are whether the data domain in question can be encoded such that the fuzzy hash makes sense. Ideally, data may be arranged linearly as some kind of ordered set, such that the order and the contents are important.

Real-world example

In my quest to come to grips with fuzzy hashing, I ran across an interesting example of how it could be used. Mr. Kornblum was exchanging ideas with a few people on this forum. One of the posters suggested:

“An odd application, perhaps, but related to what my university automatically runs against student project submissions for cheating in low-level CS classes where you have a thousand students and several different human graders.”

The poster continues:

“You would be surprised (or maybe not) how many people we caught with a similar process because they thought changing things the compiler strips out anyway would make their copying undetectable.”

I joked about it earlier, but being able to determine shades of gray digitally is nothing less than amazing.

Final thoughts

It might not seem like it, but fuzzy hashing has the potential to change the way we view our world. For example, geneticists are using fuzzy hashing to compare the gene sequence of an unknown microorganism to that with a known genome.

I’d like to extend my thanks to Mr. French for his helpful answers and to Mr. Richard Lynch, also of Carnegie Mellon, for making it possible.