Security

Mirroring Web sites with wget

Vincent Danen demonstrates the basics of wget for mirroring entire Web sites. It can be used for HTTP, HTTPS, and FTP sites either anonymously or with authentication for all of these protocols.

Recently, one of my favorite security-related sites was almost shut down due to the operator's lack of time to keep it up to date. The site provided proof-of-concept and exploit code for various security vulnerabilities in a wide range of products across multiple platforms. The site, milw0rm, is invaluable to security researchers, and not to have access to that data would have been a huge loss.

Of course, once I heard the site might be going down, it was a mad rush between myself and many other security researchers to obtain a local mirror of the contents of the site for ourselves. This resulted in a large number of people hammering the site to obtain a local mirror. This minor panic reminded me of the importance of a good site mirroring tool.

The quickest and easiest way to mirror a remote Web site is to use wget. Wget is similar to cURL (and I'll be the first to admit that I prefer cURL over wget), but wget has some really slick and useful features that aren't found in cURL, such as a means to download an entire Web site for local viewing:

$ wget -rkp -l6 -np -nH -N http://example.com/

This command does a number of things. The -rkp option tells wget to download recursively, to convert downloaded links in HTML pages to point to local files, and to obtain all images and other files to properly render the page.

The -l6 option tells wget to recurse to a maximum of six nested levels, while -np tells it not to recurse to the parent directory. The -nH option tells wget not to create host directories; this means that the files will be downloaded to the current directory rather than a directory named after the hostname of the site being mirrored.

Finally, -N tells wget to use time-stamping, which is its way of trying to prevent downloading the same unchanged file more than once. Unfortunately, with dynamic sites being the norm, this may not work very well, but it's worth adding, regardless.

Wget is capable of mirroring HTTP, HTTPS, and FTP sites. It can do so anonymously or with authentication for all of these protocols. The wget manpages have a lot of information on the wide variety of options, and it's well worth checking out.

Get the PDF version of this tip here.

About

Vincent Danen works on the Red Hat Security Response Team and lives in Canada. He has been writing about and developing on Linux for over 10 years and is a veteran Mac user.

Editor's Picks