Open Source

Use MD5 hashes to verify software downloads


Professor Ronald Rivest of MIT created the MD5 cryptographic hash function in 1991 to replace the earlier MD4 algorithm. It employs a 128-bit hash value, typically expressed as a 32-character hexadecimal number. For instance, an MD5 hash generated from an OpenOffice.org download (v2.3.0 for Win32, English language) looks like this:

  beda08800f9505117220b6db1deb453a

Since that time, MD5 has become an Internet standard (see RFC 1321 for details), and has come to be used for a great many purposes. While I am not aware of any statistical studies that support or dispute this, I believe the two most common uses are:

  1. hash comparison for password authentication
  2. hash comparison to verify file integrity

In either case, the MD5 algorithm is used to generate a hash value from the known good data -- either the original password in the first case or the original file in the latter case. For password authentication, then, whenever the password is entered by someone attempting to log in, a hash is generated from the entered password and compared against the stored hash. If they match, authentication is determined by the system to be successful. For file integrity verification, such as when downloading an application installer, there is often an MD5 hash (often called a "checksum") provided along with the download. To verify the file is the original, uncorrupted file you wanted, generate a new hash from the file and compare it against the MD5 hash provided with the download.

There are at least a couple of reasons to verify the integrity of a software download, such as with an MD5 hash:

  1. The file may have been corrupted during download, such as by lost packets if there is significant network latency.
  2. It's always a good idea to make sure someone has not somehow arranged for your download to be compromised so that you get a modified or different file that can be used to crack security on your computer when executed.

When working with the software management system of most open source Unix-like OSes, such as portupgrade for FreeBSD or APT for Debian GNU/Linux, it should handle hash comparisons for you automatically, behind the scenes. That is one of the reasons for a modern software management system: it simplifies the end-user's part of the process of making sure that software installation is as secure as it reasonably can be.

If you are a developer, an alpha tester, or a user of an OS that does not provide this sort of protection for most software installation, you may find you need to install software that is not handled by a software management system. In such cases, it is still (at least usually) a good idea to verify hashes to make sure you are getting exactly what you expect.

The OpenOffice.org website provides some instructions on how you can verify MD5 hashes on a variety of platforms. As of this writing, it provides instructions for verification using the MD5 Hash Tool extension for the Firefox browser regardless of OS, the digestIT tool for MS Windows, and the md5sum command line tool for Linux systems.

A command that exists on BSD Unix systems like FreeBSD is simply called md5, and it works in much the same way as md5sum on Linux systems like Debian GNU/Linux. An example of generating an MD5 hash from a file called "test.txt" follows, where > is the shell's command prompt:

> md5 test.txt

MD5 (test.txt) = d76b04fbbf392f6917e119bedf78d2ef

As you can see by comparing this with the OpenOffice.org Using MD5 Checksums page, the FreeBSD md5 utility can be used the same way as the Linux md5sum utility. The only difference is the format of its output.

While MD5 is not the strongest cryptographic hash tool in the world these days, it is still generally useful for verifying file integrity when downloading software. Because so many open source software development projects use MD5 hashes for verification, it is a good idea to learn how to use it and keep an MD5 hash generating tool handy if you ever need to go outside of a secure software management system when installing software.

About

Chad Perrin is an IT consultant, developer, and freelance professional writer. He holds both Microsoft and CompTIA certifications and is a graduate of two IT industry trade schools.

16 comments
Dumphrey
Dumphrey

Thanks for the blog, especially the link to digestIT, I have been using a similar product for while now called hash onclick, which is only an md5 report for a given file, useful when moving large files etc to verify integrity, but no copy/paste which can be useful. One question, do you happen to know off the top of your head if the Fedora Yum manager performs hash checking? I always assumed it did...

harrylal
harrylal

I like the idea. I went to the link, downloaded the windows version. It turns out that it only works with version 1.0. Oh well, lets try the other one noted as being for all platforms. Hmmm, only compatible with version 1.4. Is there a version somewhere that is compatible with 2.x?

apotheon
apotheon

YUM, APT, and portupgrade all perform hash checking. Fedora's YUM and Debian's APT each provide download verification using OpenPGP signatures. FreeBSD's portupgrade double-checks itself, verifying by both MD5 and SHA256. The better automated software management systems are nice that way.

aikimark
aikimark

If one is a software developer or cyber vendor, sharing files with others, it might be prudent to calculate and publish two different hash values. An example of this is: http://www.codeproject.com/KB/security/BlockCiphers.aspx While neither the MD5 nor SHA1 are considered cryptographically 'safe' at this point, using both is still sound.

apotheon
apotheon

You may want to reread some of the article. I said: "[i]While MD5 is not the strongest cryptographic hash tool in the world these days, it is still generally useful for verifying file integrity when downloading software.[/i]" Then, in [url=http://blogs.techrepublic.com.com/security/?p=377]my most recent article[/url], I said: "[i]Because downloading software involves an implicit trust in the provider of the software in the first place, the potential for abuse in file verification hashes is very slim. Because you do not get to choose the inputs that will match a given hash, you cannot simply generate two versions of a program ? one that is benign and one that is malign ? and use that to slip malware past someone?s defenses while providing an MD5 hash for verification that both software files match. "On the other hand, because in authentication systems a password?s only function is to produce a given hash, and circumventing the security of the authentication system does not require tricking a human being into believing a second input to a given hash is the same as the first, the security implications of a hash algorithm?s collision weakness can be far greater than in the case of verifying a file download.[/i]" You might also want to more closely read some of what you are citing. From the Computerworld article: "[i]These results, while mathematically significant, aren't cause for alarm.[/i]" . . . and, finally, I never specifically recommended using MD5 over other algorithms. What I did is explain that the use of MD5 hashes for download verification is ubiquitous, and explain how to make use of it. If you find a download with a verification hash provided by a better algorithm, have at it -- there are some that provide SHA-256 in addition to MD5, for instance. Many, however, only use MD5. What are you going to do if a given download provides verification only via MD5? You don't have many options. On one hand, you can use it -- which is as safe as the place where you're downloading it. On the other hand, you can choose to avoid it -- which is less safe, because now you aren't using [b]anything[/b] to try to verify your download. Here's a key statement from today's article, to help you understand why this isn't the terrible security vulnerability you think it is: "[i]Because downloading software involves an implicit trust in the provider of the software in the first place, the potential for abuse in file verification hashes is very slim.[/i]" . . . and, to wrap this up, here's my final sentence from this article about MD5 hash verification, with the part of the sentence that is actually relevant to the reason for what I [b]do[/b] recommend bolded: "[i][b]Because so many open source software development projects use MD5 hashes for verification[/b], it is a good idea to learn how to use it and keep an MD5 hash generating tool handy if you ever need to go outside of a secure software management system when installing software.[/i]" Thanks for reading and commenting. I wish you had read more closely.

apotheon
apotheon

I'm not aware of any updated MD5 extensions for Firefox. Try using the digestIT tool instead, if you need something for MS Windows. edit: [url=http://www.google.com/search?q=md5+windows]This[/url] might help, too.

Dumphrey
Dumphrey

That makes me feel a little warmer towards YUM. And the discovery of YUM Extender has done much as well. Why it is not installed by default I do not know, but the act of saving header information for each repository speeds up yum installs tremendously, this behavior should have been default, but at the same time, I can see why they chose to re-scan header info each time... irritating as that may be.

aikimark
aikimark

@apotheon "Then, in my most recent article, I said:" What you might have written in an unlink/unreference article should not be used in your response. How are we to know that such articles exist and are relevant to this particular blog entry? =========== I didn't disagree with your premise that MD5 is both ubiquitous and useful. I thought that your blog entry readers should know that there has been some cryptographic research showing weaknesses in MD5. Although it isn't broken, NIST has put it on their "not for future use" list and has a digest hash contest, similar to the crypto contest that gave use AES. =========== My earlier comment in this thread should not be construed as a denigration of your original entry. * "generally useful" is not in dispute. * MD5 status as de rigeur hash standard is not in dispute. You use what is available. * Software downloads should ALWAYS be conducted in a trust relationship. However, it would be advisable to scan for viruses after successful download prior to installation. =========== I think I read your blog entry "closely" enough to understand it. I had attempted to add valuable information to readers of this blog entry, not discount its content or denigrate its author. Thank you for allowing me to clarify my comments.

apotheon
apotheon

I figured I should post this separately, so it gets the attention it deserves. I want to make crystal-clear one of the reasons I have objected so strongly, and at such great length, to your complete mischaracterization of what I said in that article. Specifically: If you, in the manner in which you've phrased your statements, manage to convince readers that I actually said the things you claim I've said, it can lead to some pretty poor results. Such readers, if you manage to convince them that I said they should use MD5 instead of (for instance) SHA-256, may then assume that I know what I'm talking about when I supposedly say this and go out choosing MD5 over SHA-256 (or OpenPGP, or whatever). Thus, I would prefer that you cease and desist in all your persistence in putting such words in my mouth so that readers will not be so poorly misled. I never made any such claims, and I don't want them thinking I did because of what you said. What do you imagine would happen if someone read your comment suggesting I said such a thing, and tried to decide whether to believe you on the subject of the usefulness of MD5 or what you've claimed I said? Don't you think some of these readers might figure that since I'm the one writing articles, and you're some unknown quantity schlepping in off the Internet to throw much, I'm right and you're wrong? Ironically, they might then take your inaccurate representation of my words as an indication that I'm exhorting them to use MD5 at the expense of other algorithms, and your inflammatory statements about what I supposedly said may have the interesting (and unfortunate) result of convincing people to do the opposite of what you seem to want them to do. I simply couldn't let that happen, in good conscience. I [b]had to dispute you[/b] to maintain any sense of self-respect on this issue -- to point out where you've misrepresented my words. If you had simply [b]asked[/b] whether I'd meant what you instead [b]claimed[/b] I meant, and mentioned the collision weakness, I could have answered the question directly and explained the differences between a collision weakness and a preimage weakness. Instead, you took an accusatory tone, put words in my mouth, and generally created a necessarily combative atmosphere, wherein my only options were to ignore it (which I've already explained I couldn't do in good conscience) or to dispute your claims about my intent and actions (as I did).

apotheon
apotheon

"[i]I didn't ask you to agree with me.[/i]" Who said you did? "[i]I was merely asking for some resolution between two MD5-related blog entries and your comments to me.[/i]" Understanding of the connection should be pretty clear based on the commentary in [url=http://techrepublic.com.com/5208-6230-0.html?forumID=102&threadID=247349&messageID=2380657][b]an earlier post[/b][/url] of mine. I tried to point that out in [url=http://techrepublic.com.com/5208-6230-0.html?forumID=102&threadID=247349&messageID=2384115][b]my just previous post[/b][/url], where I said "I actually quoted (and linked to) that article in my first response to you, elaborating upon the difference between the MD5 dangers to passwords and the effective lack of danger for software download verification in the common case." "[i]The word, use, is a directive or imperative for the reader.[/i]" It's a headline -- truncated in construction, implying more context. In this case, what it implies is "how and why to", making the full-length (and unacceptably long) version of it "How and why to use MD5 hashes to verify software downloads". Even that is truncated and implying more context, however. By the time I'm done adding in what was truncated and implied, to satisfy you, the headline would be a reprint of the article. "[i]It isn't just receivers of files who are readers of your blog entries. Software developers and program managers are amongst your readers.[/i]" If you don't get the impression that this particular article was specifically targeted to downloaders of software, then one of two things has happened: either I utterly failed to make my point, or you (probably intentionally) misread my meaning, perhaps because you wanted some way to "prove" me "wrong". I'm pretty sure I made numerous references to downloading, and none specifically to the reader providing downloads, so I'm going to hazard a guess that the failure was on your part in this case. Maybe I'm mistaken. In any case, you never criticized the clarity of my writing style. You've only attacked the motives you've assigned to me -- which were not, in fact, the motives I had in composing that article. "[i]MD-5 hash collisions have been detected and demonstrated by hash researchers.[/i]" . . . to which I obliquely referred in this article, and more explicitly referred in the later article to which first I, then you, linked. This is not news to me, nor is it news to many of my readers by now. Furthermore, it in no way contradicts anything I've said. "[i]NIST (and others) recommends against using them for cryptographic functions and recommends the adoption of stronger or multiple hash algorithms for future applications.[/i]" That's great. I recommend stronger algorithms than MD5 for most purposes, too -- including for providing software downloads. However: 1. Sometimes, when downloading, that's all the provider has given us. We work with what we have. 2. Much of the time, it's difficult to make a point and ensure that others will understand it without first making a different point. At some point in the future, I may write an article specifically about using stronger cryptographic hashing and signing functions to verify downloads, but first I need an article like this one to which I can link in that article for some of the background and foundation of that other article. You obviously haven't noticed, but I have a tendency to link to older articles when writing newer articles because it can provide more complete reference material. The fact that you jumped on this as though it were my Last And Only Word on the matter of cryptographic hash functions, despite my very obvious reference to the fact that MD5 isn't the strongest cryptographic algorithm in the world, seems to be at the root of your disagreement with me. Did it ever occur to you to assume I'm not a complete idiot before assailing my statements with assignment of motives (MD5 advocacy) that are not, in fact, mine? "[i]It shouldn't matter whether your articles were about passwords or file downloads.[/i]" Really, I'm not sure you know what you're talking about. I'm trying to drag the vast, encryption-unfamiliar mass of IT professionals into the realm of security knowledgability, not get back-pats from technology purists who never actually deal directly with the outside world. I'm addressing people downloading software, and I went to some pains to specifically address those people -- I'm not telling programmers to provide MD5 hashes with their software when distributing it. If you can find a statement of mine that says otherwise, I'll eat my computer. "[i]MD-5 has weaknesses and isn't a compensation for the trust relationship you might have the the source.[/i]" Well, good! I wouldn't want to think I was wrong when I specifically pointed out that it's [b]not[/b] a "compensation" (aka "replacement") for the trust relationship with the software distributor. You present sentences like these as though they refute my statements, which is confusing, considering they're basically paraphrases of my statements in the very article you criticize! "[i]There are man-in-the-middle exploits and rootkiit viruses that could take advantage of the MD-5 weakness(es) to distribute malware, even from a trusted source.[/i]" Um, no, there aren't. This is where I start doubting you know what you're talking about, again. See previous comments about the difference between a collision weakness and a preimage weakness. A collision weakness is a problem when you cannot necessarily trust someone, but a cryptographic hash or digital signature can be used to technically verify that the other party is treating you fairly in this instance. It is [b]not[/b] a problem when you simply cannot trust the medium of transfer. That's why the only way the collision weakness is a problem for file verification is if the distributor is trying to pull a fast one on you -- and, when dealing with software downloads, even that danger is so vanishingly small that in cases like the MD5 collision weakness it's probably ignorable. Preimage weaknesses basically allow you to forge a successful hash check; collision weaknesses do not. There's also the collision weakness problem involved in authentication, of course, but that's only because the content [b]doesn't matter[/b] in that case: only the hash does. That means that a collision weakness increases the number of matching permutations, thus potentially reducing the necessary time for a brute-force attack (in some cases). Please read up on collision vs. preimage weaknesses before revisiting this particular branch of your so-called argument. "[i]How I got to the article isn't important.[/i]" I didn't say it was, except to point out two things: 1. that it's obvious you aren't paying much attention to what I'm saying -- since I specifically linked to, and quoted from, the other article; 2. that you apparently have a hypocritical tendency to declare my statements null, void, and irrelevant if they come from other articles while, ironically, you quote [b]the same article[/b] in a response to me, thus making it clear that there's a double-standard which makes the same behavior for you "okay". "[i]If someone disagrees with you, it is OK to pounce on them?[/i]" Much of the problem here is that you [b]didn't[/b] disagree with [b]me[/b] -- you just disagreed with things I didn't say, and accused me of saying them. "[i]If someone comments that the underlying technology might be flawed, it is OK to pounce on them?!?[/i]" Is it okay to "pounce on" me for "recommending" a flawed technology when, in fact, I only recommended learning how to use it and [b]even pointed out that it's a flawed technology myself[/b] in the article to which you objected? Is it not okay for me to [b]point that out[/b]? "[i]I'm not belittling this blog entry.[/i]" Maybe not with that sentence -- or maybe you aren't trying to belittle the article. Maybe you're just trying to belittle me. Frankly, I don't care either way -- I just don't want you playing Character Assasination Games and trying to pump up your own ego by distracting people from actually learning something from the article. "[i]I think it is also important for folks to know about MD-5 weaknesses if they are planning/disigning/writing/deploying a file distribution system[/i]" Feel free to give me hell if I write an article about those topics, rather than about being on the receiving end, in which I talk about MD5 without pointing out there are better options. "[i]might be relying on the strength of MD-5 to ensure integrity of the files they download from their trusted sources.[/i]" It's better to use MD5 than nothing at all, when MD5 is all that's available for a given download. It's also the case that MD5's weakness is not as notable in the context of receiving files from trusted sources -- as I pointed out, and even tried to explain in response to you taking issue with it, but that you seem unwilling to grasp.

aikimark
aikimark

1. I didn't ask you to agree with me. This far into our online discussion, we don't seem to even be communicating. I was merely asking for some resolution between two MD5-related blog entries and your comments to me. 2 & 6. "I never made a specific recommendation that people should use MD5 ..." The title of your original blog entry begins with the word "Use", not "Using". The word, use, is a directive or imperative for the reader. 3. It isn't just receivers of files who are readers of your blog entries. Software developers and program managers are amongst your readers. My comments were directed as much to the distributors of files as the recipients of the files. 4. MD-5 hash collisions have been detected and demonstrated by hash researchers. NIST (and others) recommends against using them for cryptographic functions and recommends the adoption of stronger or multiple hash algorithms for future applications. It shouldn't matter whether your articles were about passwords or file downloads. MD-5 has weaknesses and isn't a compensation for the trust relationship you might have the the source. There are man-in-the-middle exploits and rootkiit viruses that could take advantage of the MD-5 weakness(es) to distribute malware, even from a trusted source. 5. How I got to the article isn't important. I didn't know about the other article until your responses prompted me to look at the history of articles you'd written. 7. If someone disagrees with you, it is OK to pounce on them? If someone comments that the underlying technology might be flawed, it is OK to pounce on them?!? It seems that we've spent a lot of time responding to comments without actually helping readers of this blog entry or the subsequent comments. I even commented on one of my comments. I'm not belittling this blog entry. It is a good thing for you to inform folks about hashing topics...definition, different algorithms, common uses, weaknesses. I think it is good for folks to know what tools are available when checking hashes. That said, I think it is also important for folks to know about MD-5 weaknesses if they are planning/disigning/writing/deploying a file distribution system or might be relying on the strength of MD-5 to ensure integrity of the files they download from their trusted sources.

apotheon
apotheon

"[i]You bust my chops (at least write very defensively/derisively), but write a new blog entry agreeing with my comments?!?[/i]" 1. You might notice, if you read closely, that the article to which you linked was posted a day before your first comment here -- and it was written even before that. I'm not writing posts to agree with you; your characterization of events is misleading. 2. You didn't just comment on a collision weakness in the MD5 algorithm. You said I should rethink my recommendation of MD5. I never made a specific recommendation that people should use MD5 -- I just explained how it's used, mentioned the fact it isn't as strong as some other algorithms, and explained why it's not a big problem to use it for verifying downloads. 3. Speaking of why it's not a problem: Because the collision weakness is not a preimage weakness, the only way it's a problem for software download verification is if the person providing the download is untrustworthy. If that person is untrustworthy, you're screwed anyway -- so it's an entirely superfluous weakness in that respect. Even then, it would be extremely difficult for the person providing the download to come up with two working pieces of software, one innocuous and one malicious, that match the same MD5 hash. 4. The newer article to which you linked is about password hashes, not download verification hashes. The security landscape is somewhat different there, thus justifying giving a crap about the collision weakness in that context. I even explained that in some detail in that article. 5. I actually quoted (and linked to) that article in my first response to you, elaborating upon the difference between the MD5 dangers to passwords and the effective lack of danger for software download verification in the common case. You seem to have ignored that, however, and didn't bother to check the material until that same article appeared in your inbox in a TR newsletter (the URL you provided, ending in "&tag=nl.e036", indicates how you got to the article). Additionally, when I quoted that article you decided that it somehow "didn't count" -- even though I was just using it to elucidate a point -- because it wasn't from the same article as the one to which you responded. Somehow, though, it's okay for [b]you[/b] to quote from my other articles -- from [b]the same article[/b], even. What's up with that? 6. I didn't write "defensively". I pointed out all the problems with your statements, starting with the one that suggested that I "recommended" MD5 over other algorithms. 7. Even if I did write "defensively", it's not entirely unwarranted considering the accusation implicit in the title of your first post in this discussion. edit: added some quotes

aikimark
aikimark

@Chad In your more recent blog entry: http://blogs.techrepublic.com.com/security/?p=377&tag=nl.e036 You seem to acknowledge the points I've made about the use of MD5: "The problem with the MD5 hash algorithm is that it suffers from a collision weakness. This means that someone could generate two separate inputs that both produce the same hash output from the MD5 algorithm. There are some significant negative security implications for this. For instance, someone could create two files that produce the same cryptographic hash, one of which appears to be innocuous and the other of which matches the hash of the first but in some way defrauds or attacks someone who expects the innocuous message and uses the hash to verify it." ==================== So, what's going on here? You bust my chops (at least write very defensively/derisively), but write a new blog entry agreeing with my comments?!?

apotheon
apotheon

"[i]What you might have written in an unlink/unreference article should not be used in your response. How are we to know that such articles exist and are relevant to this particular blog entry?[/i]" I didn't realize this was a game with rules, and you wanted to win so badly. I brought it up to point out that you seem to have misunderstood my point -- to reinforce the quote I provided from the article to which you directly responded, not as a point meant to stand alone. Telling me that referencing other sources to demonstrate that is against the rules of your game is unlikely to convince me of anything other than that you have a petty need to prove people "wrong" about things. Obviously, I hope that is not your motivation -- but I'm having a hard time coming up with a believable alternative interpretation right now. Also . . . I now realize that my presentation of my statements from the immediately preceding comment was not as clear as it could be. Another reason for bringing up what was said in another article was to point out that you seemed to have mistaken my attempt to keep the article to which you responded on-topic (and the weaknesses of MD5 outside of verifying communications within a trust relationship are definitely off-topic for that article) for an implicit statement that the MD5 algorithm lacks weaknesses. "[i]My earlier comment in this thread should not be construed as a denigration of your original entry.[/i]" Perhaps not the content of the comment -- but the title could be interpreted no other way, where you said "You may want to rethink your MD5 recommendation". "[i]I think I read your blog entry 'closely' enough to understand it.[/i]" I find that difficult to believe, considering I never recommended MD5 as a preferred cryptographic algorithm -- but you very clearly indicated that you thought I did so. Perhaps that wasn't your [b]intent[/b], but it's what you had accomplished.