Software Development

How to use shell functions to fetch information online

Marco Fioretti shows two examples of shell functions that you can use for web scraping when all you need is a quick way to extract text from a given website.

Even in this age of touchscreen devices, many computing activities are much faster if you know the right tricks and stick to plain old typing. In my case, this applies to retrieving certain types of information from the Internet.

Most of my work consists of typing at a prompt or in applications that, like the Kate text editor or the Dolphin file manager, have an embedded terminal (and one of the reasons I prefer such applications is exactly that they make using certain tricks faster, no matter what else I am doing).

I save a not negligible amount of time when I'm doing system administration or just writing some text, thanks to shell functions like those that I'm going to present in a moment. Please note that none of these functions does anything difficult, or advanced. All they do is fetch some simple data from the Internet that I often need -- in the fastest possible way -- without forcing me to switch to another window. The reason why they are functions instead of autonomous scripts is that I also use them inside several scripts.

What are shell functions anyway?

In software programming, functions are blocks of code that perform one specific task, written in a way that can be easily reused and shared by many programs, possibly running every time with different input values.

Unix shells, that is the command interpreters that actually execute what we type at a prompt or save in a script, have functions just like compiled languages like C or C++. Shell functions can be called either at the prompt or from a script, and you only need to know a few things to start writing and using them:

  • Shell functions must be defined before you invoke them!
  • To have your functions always available at the prompt, you can save them in the $HOME/.bashrc file (or the equivalent one for non-Bash shells)
  • In Bash, the default shell on most Gnu/Linux distributions, functions can be defined in these two equivalent ways:
  function my_bash_function { the function code goes here... }
  my_bash_function () {  the function code goes here...  }

Weather forecast

A function I (have to) use more than it would be good for me, at least in certain periods, is the one that prints the weather forecast. Yes, I too think that looking out the closest window would be much simpler and smarter, but what when you're in some conference or meeting room without windows? Here is how this function works:

  [marco@polaris ~]$ weather
  Weather for Rome
        4°C                 Thu    Fri    Sat    Sun
  [sun] Clear              [sun]  [sun]  [par]  [sun]
        Wind: S at 6 km/h
        Humidity: 75%      10° 3° 12° 4° 13° 5° 11° 3°
  [marco@polaris ~]$

And this is its code:

  weather ()
  {
  w3m -dump "http://www.google.com/search?hl=en&lr=&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&q=weather+${1}&btnG=Search" >& /tmp/weather
  grep -A 5 -m 1 "Weather for" /tmp/weather| cut -c28-
  rm /tmp/weather
  }

The function uses the w3m text-based browser to ask Google the weather forecast, save it in a temporary file and then extract from it, cutting unnecessary empty columns, the six lines starting from the one that contains the "Weather for" string. If invoked without arguments, this function will return the forecast for what Google thinks is your current location, but you may also specify other places, e.g. "weather "San Francisco".

What's general and great in this function is that it shows how easy it is to get started with Web scraping. This term indicates exactly what you have just seen at work in the example above: download the text version of some Web page, then cut and slice it to extract all and only the data you really need, all automatically. The functions that follow use the same technique to fetch another kind of information I often need.

Word Definition

What does that word exactly mean? When I'm in doubt, I ask my shell:

  [marco@polaris ~]$ define weird
   weird (wîrd)
   adj. weird·er, weird·est
   1. Of, relating to, or suggestive of the
   preternatural or supernatural.
   2. Of a strikingly odd or unusual
   character; strange.
   3. Archaic Of or relating to fate or the
   Fates.
   n.
   1.
   a. Fate; destiny.
   b. One's assigned lot or fortune,
   especially when evil.

The answer, of course, comes from a function very similar to the one that provides weather forecasts:

  define ()
  {
  w3m -dump http://www.thefreedictionary.com/weird >& /tmp/define_word
  grep -A 15 ^Advertisement /tmp/define_word | cut -c20-60
  rm /tmp/define_word
  }

If you want to know why I start extracting text from the line that begins with "Advertisement", type:

w3m -dump http://www.thefreedictionary.com/weird | more

at a prompt and look closely at the resulting text.

Doing that, you will also notice the biggest difference from the other function, besides the obvious fact that this goes to a different website. Since many words have definitions much longer than 15 lines, here I just estimate how many lines I should read to get enough information. Extracting the whole definition and nothing else, regardless of its length, would certainly be possible, but requires more advanced text parsing than I may show you in this space. Besides, doing it would not be worth the effort in this particular case, when all I want is to get a quick idea of what some word means.

Credits

The two functions above are my own, updated versions of those I originally fetched from David Crouse. Thanks, David! Web scraping is great, and doing it from shell functions makes it even more flexible.

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

9 comments
todd_dsm
todd_dsm

It goes out to some site, finds the public ip address of the company who's location you happen to be working for, at the time, and gives you their public ip. But, I really dig the word definition search.

wolsonjr
wolsonjr

thanks, handy - but dictionary example has 'weird' hardwired instead of arg 1, ${1}

mfioretti
mfioretti

Hi Wolson, you're right, that remained from the temp version I use for testing (because the script I actually use is longer and also does stuff that wasn't really relevant here). Thanks for the correction

shoggothe
shoggothe

... piping or using process output as an argument when piping isn't adequate/possible like so: grep 'pattern' <(shell command)

Sterling chip Camden
Sterling chip Camden

On FreeBSD at least, you can use 'fetch' instead of w3m's dump. Why spin up a whole browser if you aren't going to use most of it? I don't know if fetch is available on all *nix platforms, though.

mfioretti
mfioretti

I normally tend to use w3m instead of fetch in scripts like these for the reasons that brister explained, plus the fact that (at least on some servers I help manage) I have to install and use it anyway for remote administration via webmin

Sterling chip Camden
Sterling chip Camden

I tend to think in terms of feeds, where you want the markup because it's semantic.

brister
brister

Using 'fetch' would leave you with the raw html which you would then have to parse. That would be more reliable (probably), but a lot more complicated. w3m gives you the plain text which is a lot simpler.