The World Wide Web is a wonderful, but really weird place, partly because it is so big. You can spend half the time you are awake on it every day, for years, without ever stumbling into someone, or something, that could be very useful to you. I had such an experience last week.
If you follow me on this column, you will already know how fond I am of Web Scraping. In spite of that, I had never heard or read about the 10+ year old program Surfraw, until a few days ago. To make things even weirder, I really can’t remember on which page I finally found a link to it, or what exactly I was looking for in that moment.
I somehow feel that this “mysterious” discovery fits in well with the history of this program, and the attitude behind it. As far as attitude goes, nothing but a quote from the home page can explain it better:
Surfraw reclaims [search engines and many other Web services] from the false-prophet, pox-infested heathen lands of html-forms, placing these wonders where they belong, deep in unix heartland, as god loving extensions to the shell… a Surfraw liberateur is capable of navigating speeds that leave GUI tainted idolaters agape with fear and wonder.
A bit messianic, isn’t it? Ditto for the name: Surfraw stands for “Shell Users’ Revolutionary Front Rage Against the Web”. The copyright information only reinforced this impression, as the code isn’t strictly Free-as-in-Freedom Software, but (emphasis mine):
The copyright holders listed above assert no rights on this release of the software surfraw and thereby explicity place this release into the public domain. Do what you will.
Then I discovered that the copyright holders include a “Julian Assange, 2000-2001″. That explained my feeling, and made me even more interested in trying Surfraw.
What the heck does Surfraw do anyway?
Surfraw is a command line program that finds information for you online, by querying the right search engines or other Web services, much more quickly than you could. You may think of it as a console version of the custom search bars you can have in modern graphic browsers. Iinvoke Surfraw in this way:
surfraw google -results=4 "country music"
And it will open your default browser on the Google search results for “country music”, formatted four per page.
Meet the Elvi(s)
Of course, if this were all you could do with Surfraw, there would be little or no reason to use it today. Opening a search engine, or any other portal, by typing a command at the prompt made much more sense in 2000/2001 than now, when many of us have a graphic browser, with a dedicated, customizable search bar, open all the time anyway.
Surfraw, however, can do much more than that, saving you a lot of time, because of two reasons. First of all, being a command line program means that you can use it automatically, in combination with other programs, inside scripts. Surfraw has two options just for these cases:
-p | -print
-o | -o=FILE
The first one prints the URL that Surfraw would normally pass to the browser, so you can save it, and use it as is, in your scripts, even on systems where Surfraw is not installed. Here is an example:
#> surfraw google -results=4 "country music" -p
the -o option passes the URL corresponding to your search to a text-mode browser, then dumps the result of the search to standard output, or to FILE, from where you can process it in any way you want.
The other, even more powerful, real feature of Surfraw is the number (more than 100, plus third-party ones here) and above all the variety of its Elvis. A Surfraw Elvi (yes, the name is a tribute to that Elvis) is a snippet of shell code that knows how to build the query URL for the corresponding website. Surfraw provides, among many others, Elvis ready to visit on your behalf:
- the Los Alamos Science E-Print Archive or the SAO/NASA Astrophysics data system
- Stock quotes listings
- Many FOSS software archives
- Wikileaks cables (who would have thought?)
- Various bugzillas
- Currency conversion portals
- Bit Torrent listings
- Wolfram MathWorld
- Repositories of PGP Keys
- The Internet Archive’s Wayback Machine
- Pages suggesting the best rhymes for certain words
See what I mean? Combine all this with how easy it is to process the results with other tools and, even in this “browser-always-open” age, you’ll get a powerful assistant for researchers, writers, and all other heavy WWW users. You may even integrate Surfraw in your Firefox search bar, or get a unified interface to all the Surfraw Elvis and bookmarks.
How to install and learn to use Surfraw
You have no excuses not to try Surfraw. Installation is a no-brainer on the most popular Gnu/Linux distributions: just tell your package manager to grab the Surfraw binary package, with all its dependencies, from the standard repositories you already use. Next, take some time to try the several options, saving the ones you want in your Surfraw personal configuration file ($HOME/.config/surfraw/config).
Final word of advice: while installation is a snap, and documentation is clear, some of the examples may not work without changes. When this happens, the most likely reason is simply that the distribution packagers put files in different directories. On Fedora 17, for example, I found that the Elvis are not stored in /usr/lib/surfraw/, but in /usr/libexec/surfraw.