Surfraw is a command line tool that automates queries to many types of websites. Out of the box, Surfraw knows how to talk standard search engines, mailing list archives, Wikileaks cables, Ebay, software repositories and many more portals. When I first explained why and how to use Surfraw, among other things I wrote that:
…the real feature of Surfraw is the number… and above all the variety of its Elvis. A Surfraw Elvi… is a snippet of shell code that knows how to build the query URL for the corresponding website.
This quickly prompted a reader to ask details on how to write a custom Elvi. I promised to answer sooner or later, so here I am.
The official Surfraw documentation includes a Hacking Guide that explains how to write your own Elvis but has, in my opinion, a couple of limits. One is its general style, sometimes as messianic as the Surfraw home page; another is that it takes, more or less for granted, a few crucial things.
This may discourage beginners from trying to extend Surfraw, which would be a shame: the tool is wonderful, and writing Elvis isn’t really as complicated as it sounds from its own Guide. Therefore, I will only try to complement that guide, focusing on one thing.
Get the code!
The Surfraw source code and that of its average Elvi are much longer than the scripts I usually insert in these posts of mine. In order to let you follow my explanation, I INCLUDE HERE, as a tar archive, the whole code of the actual Surfraw script, and that of the Google Elvi mentioned in the documentation. Please download and open those two files before reading the rest of the post.
Making your own Elvi
Are you ready? Great, let’s begin!
Surfraw queries websites by building a special URL for them, with all your parameters inside, and fetching it with a text browser (URL stands for “Uniform resource locator”, or “Internet address” in layman language).
Elvis are launched in line 558 of the Surfraw script:
sh -c "$elvidir/$elvi $opts $searchterms"
That basically means “execute the code in the file $elvidir/$elvi with the options and search terms previously calculated”. The options vary from Elvi to Elvi. The ones for Google, for example, may include whether you are or are not “feeling lucky”.
The Surfraw guide explains how to write Elvis using the Google one as reference but, in my opinion, it doesn’t explain simply enough how to gather the information to build those URLs.
In the case of the Google Elvi, the URL construction happens in lines 92 to 121 of the google.sh file. Line 120 shows that the URL consists of three parts:
url="${url}${safe}${extra}"
The first piece contains the generic URL of the Google website, with country code included if needed. The second is the value of the Safe Search option, and the last contains the actual search query and options, all properly encoded.
That’s cool, but how did they know they had to write just that code? How do we do the same thing for another website? We need to figure out, by looking at that code, what is the general method, and starting points to use to create these URLs.
In order to do this, first ask surfraw what URL it would use to search for bash and shell on Google:
#> surfraw -p google bash shell
http://www.google.com/search?q=bash%20shell&num=30
Next, do the same search manually: go to www.google.com, type bash shell, and press Google Search. When the results appear, look in the address bar of your browser. It will contain a URL that expresses just what you asked Google to search, and will look something like this:
https://www.google.com/search?q=bash+shell&STUFF
(the URL above is cut for brevity: “STUFF” stands for “many other non-critical parameters, possibly violating your privacy, that tell Google things like what browser you use”).
At this point, looking side by side to the Surfraw output, the URL in your browser and the google.sh file, will show what the method is.
Surfraw parses and encodes the arguments it receives using always the same, standard functions, as you can see in line 70 of google.sh. In our example, it is those functions that encoded the plus sign as “%20”. Apart from encoding, which is always the same, what makes that file work on Google.com, and nowhere else, is the custom pieces manually written in lines 113 to 115, here isolated inside brackets for clarity:
url="${url}. <<< google.${domain}/${search} >>>"
escaped_args=`w3_url_of_arg $w3_args`
url="${url} <<< ?q= >>> ${escaped_args}&num=${SURFRAW_google_results}"
See the trick? What the author of that Elvi did was simply to:
- look at the URL generated by the Google form, for several queries
- copy and paste into the code the relevant, Google-specific parts, like “?q=”
- repeat until Surfraw produced the same URLs
Some parts were added as general options, but they were obtained using the same method. And this is the same process you have to follow to create Elvis for any other website. Possibly boring, but not difficult, is it? The exceptions are websites that only accept data via the POST method, which is not supported by Surfraw at all, as explained in the documentation.
And finally…
When the Elvi file is finally ready, what’s left is adding it to the rest of Surfraw. This is explained very simply in Appendix 1 of the official guide. The only thing to note on that front is that, if you want to only use the Elvi on your own computer, steps 1-3 are enough. If you want to make an Elvi package to share, instead, you should follow the whole procedure.