Security

Mirroring Web sites with wget

Vincent Danen demonstrates the basics of wget for mirroring entire Web sites. It can be used for HTTP, HTTPS, and FTP sites either anonymously or with authentication for all of these protocols.

Recently, one of my favorite security-related sites was almost shut down due to the operator's lack of time to keep it up to date. The site provided proof-of-concept and exploit code for various security vulnerabilities in a wide range of products across multiple platforms. The site, milw0rm, is invaluable to security researchers, and not to have access to that data would have been a huge loss.

Of course, once I heard the site might be going down, it was a mad rush between myself and many other security researchers to obtain a local mirror of the contents of the site for ourselves. This resulted in a large number of people hammering the site to obtain a local mirror. This minor panic reminded me of the importance of a good site mirroring tool.

The quickest and easiest way to mirror a remote Web site is to use wget. Wget is similar to cURL (and I'll be the first to admit that I prefer cURL over wget), but wget has some really slick and useful features that aren't found in cURL, such as a means to download an entire Web site for local viewing:

$ wget -rkp -l6 -np -nH -N http://example.com/

This command does a number of things. The -rkp option tells wget to download recursively, to convert downloaded links in HTML pages to point to local files, and to obtain all images and other files to properly render the page.

The -l6 option tells wget to recurse to a maximum of six nested levels, while -np tells it not to recurse to the parent directory. The -nH option tells wget not to create host directories; this means that the files will be downloaded to the current directory rather than a directory named after the hostname of the site being mirrored.

Finally, -N tells wget to use time-stamping, which is its way of trying to prevent downloading the same unchanged file more than once. Unfortunately, with dynamic sites being the norm, this may not work very well, but it's worth adding, regardless.

Wget is capable of mirroring HTTP, HTTPS, and FTP sites. It can do so anonymously or with authentication for all of these protocols. The wget manpages have a lot of information on the wide variety of options, and it's well worth checking out.

Get the PDF version of this tip here.

About

Vincent Danen works on the Red Hat Security Response Team and lives in Canada. He has been writing about and developing on Linux for over 10 years and is a veteran Mac user.

14 comments
knura
knura

Although I have been using wget for many years, I had simply been overlooking the "-np" option :) With the "-np" one can mirror part of the main site e.g. specific version of a Linux distro. Thanks! For mirroring, I have used wget's --mirror option. I have also used httrack and ghttrack (gtk gui frontend). curl is another option, although my experience with it is very little.

raynebc
raynebc

The last time I tried a tool like this, I had found that some websites have code to disable "zombie" downloading for bandwidth reasons. Does wget allow you to disregard such a thing if it is configured in the web page's code?

Fman99
Fman99

Just the thing I was looking for! Needed to get all of my content off of Geocities before they shut it down next month, and it's done in a snap!

mulm
mulm

this is awful. What happened to using rsync through SSH

riceski
riceski

how to get started on this without having to read my eyeballs out?

csmith.kaze
csmith.kaze

I knew you could things like this, but have never tried. Now. We should mirror all of Google.

raynebc
raynebc

With the ability to copy an entire website and even load linked graphics, it would be a trivial task.

Neon Samurai
Neon Samurai

Rsync over ssh is the way to go if your transferring between two machines you have login access on. Wget would be the tool for building a local mirror of if you didn't have login access or wanted a locally tree for browsing.

Neon Samurai
Neon Samurai

get a copy of wget on your machine, go to an empty testing directory, run the article given command against a small website of your choosing.

Neon Samurai
Neon Samurai

If your spoofing, you probably only need a few of the pages and a browser Save As can dump those without any issues. Wget would also be no more of a concern then the other multitudes of website mirroring programs available. The mirrored site would then have to be presented. You still have the problem of redirecting traffic or clicks to your fake front. If the site admin is remotely intelligent, you then also have to fake the ssl cert. (sadly, this is till far easier to pull off than it should be)

mulm
mulm

fair enough. Didn't think of that usage. I guess I am used to having the login capabilities, so generally stick with Rsync.

Neon Samurai
Neon Samurai

I primarily use wget to pull files from servers rather than start Xwindows or open a browser. Being able to download nvidia drivers, Alsa source or other bits can be very handy. In the past, it's come in handy for things like mirroring the Debian user's manual to my PDA for reading and reference.

Editor's Picks