Project Management

Parsing data from the Web in Python

Builder AU's Nick Gibson runs through the development of a quick program to parse data from the Web.

In my last article I provided a gentle quick start to working with Python. If you're not familiar with the language you might want to go back and give that a read. Now that you've got the basics firmly under your belt, it's time to start putting them to use by writing something a little more interesting. A few weeks ago I received an e-mail from a Web host letting me know that my space was about to expire, and that I had a month to back up all my files before they were deleted. Now since I was only storing a few old photos on this particular host, it was no big loss, but I would still like to keep those pics. Rather than saving the files from the Web page one by one, or going through the Web host's administration page I wanted to write something to handle it all for me, so we're going to be walking through the development of a command line program that will parse a Web page and print the addresses of all images used on that page. By the end of this article we'll have run through opening and reading HTML data through HTTP, defining functions, adapting to various user inputs and using regular expressions to parse text (briefly).

Defining Functions

First we need to run through one more basic language feature of Python: the Function. Functions let you set aside a block of code and give it a name, so that instead of typing the whole block of code each time you want to use it, you can just refer to it by name. Defining functions in Python is simple:

def hello(name):
    print "hello " + name

The word directly after the def keyword is the name of the function, and the words inside the parentheses are the names of the parameters — the input to the function. Calling functions is just as easy:

>>> hello("world")
hello world
>>> hello("everyone")
hello everyone

Using functions is generally a good idea in all kinds of programming, since they reduce maintenance issues introduced by copying and pasting code, and allow you to group code together by what it does, making your program easier to read and maintain.

Managing user input

Whenever your program relies upon input from the user to work, you're going to run into the problem of incorrect input. Most of the time it's enough to just fail gracefully, printing an error and closing the program, but sometimes you can go one better and correct the input and continue. In this program, the user must give the program a web address as an argument and so we need to check that the input is an address that we can work with — this program is only for Web sites, and so can only accept addresses using the HTTP protocol. We'll write a function that will check this, and add the http protocol specification if none is give. The full function is below, don't worry if you don't understand it right away, we'll go through it in more detail:

def parseAddress(input):
        if input[:7] != "http://":
                if input.find("://") != -1:
                        print "Error: Cannot retrieve URL, protocol must be HTTP"
                        sys.exit(1)
                else:
                        input = "http://" + input
        return input

Firstly we define the parseAddress function, which takes one parameter — called input. Next we need to determine if we've got a correct address, so we check if the start of the string (remember that input[:7] returns a slice of the string input, from the beginning to the seventh character) is "http://" — if it is, no problem, we've got what we need. Otherwise, we could fail, but if there's no protocol specified, we'll just assume that the user gave an http address without adding the protocol specification at the beginning. We can check for the presence of a specification by using the string method find. find works on a string and a substring, and returns the index of the first match or -1 if the substring does not occur, like so:

>>> "hello world".find("hello")
0
>>> "hello world".find("wor")
6
>>> "hello world".find("word")
-1

Let's test this function (Note: if you're trying this in the interpreter, remember to import sys for the exit function):

>>> parseAddress("http://www.builderau.com.au")
'http://www.builderau.com.au'
>>> parseAddress("www.builderau.com.au")
'http://www.builderau.com.au'
>>> parseAddress("ftp://builderau.com.au")
Error: Cannot retrieve URL, protocol must be HTTP

Opening and reading HTTP addresses

Python has a wide range of modules in its standard library that make otherwise complicated tasks very simple; in this instance we're going to use the urllib2 module to take the work out of opening Web pages. Opening and reading Web sites using urllib2 is as simple as opening text files:

import urllib2

website = urllib2.urlopen(address)
website_html = website.read()

Just like with files, things can go wrong when you try to open addresses on the Internet, maybe the server is down, or your Internet connection might be broken, or maybe the file you're looking for just doesn't exist. Whatever the reason, you need to be able to handle these little problems, and in Python the right way to do that is through exceptions. urlopen can throw a number of different exceptions, but the major two you need to know are HTTPError, which are raised when the Web server you connect too sends an error code, and URLError, when another network or protocol error occurs. You can catch these exceptions like any other:

try:
    website = urllib2.urlopen(address)
except urllib2.HTTPError, e:
    print "Cannot retrieve URL: HTTP Error Code", e.code
except urllib2.URLError, e:
    print "Cannot retrieve URL: " + e.reason[1]

So, for example, when you try to retrieve a URL that does not exist you'll see an error message like:

% python2.4 images.py www.google.com/doesnotexist
Cannot retrieve URL: HTTP Error Code 404

Now maybe that's enough information for an error code like 404; most of us have seen those before, but how many could tell you off the top of their head that error code 407 means that proxy identification is needed, or that 503 means that the server is under a high load and cannot process the request? Clearly we need a more human friendly way to tell the user about errors, and again, Python provides — this time with a helpful dictionary defined in the module BaseHTTPServer. A dictionary is another basic Python type, in other languages it is sometimes called a hash or a map, we'll go into more detail about dictionaries another time but for now you can think of it as a list except rather than retrieving items by index, you can retrieve them by any kind of identifier you like. In this case, the dictionary BaseHTTPRequestHandler.responses provides a mapping between error code and explanation — if you're interested in the full list see section six of the HTTP 1.1 specification RFC. So the following code:

import BaseHTTPServer

print BaseHTTPServer.BaseHTTPRequestHandler.responses[404]

Produces the following output:

('Not Found', 'Nothing matches the given URI')

We can use this dictionary to print more sensible error messages.

Finding the information we want using regular expressions

Now that we've got the text of the Web page we want to parse, now we need to find all the image tags. For those who aren't aware, in a HTML document, images in a page are encoded in something similar to the following format (give or take a few attributes):

<img src="http://b2b.cbsimg.net/images/builder/path/to/image.png">

The part that we are interested in is within the quotes, everything else we can throw away. We're going to use a simple regular expression and the findall function to do all the hard work for us. Regular expressions are a way of specifying the grammar of a string of tokens (in this case tokens are characters: numbers, letters or punctuation), so that a whole category of strings can be recognised, rather than having to specify each one individually. Regular expressions are a topic for their own article, but lets run through a couple of quick examples:

Say you had some text and you needed to recognise all strings of binary numbers, that is, strings that consist only of zeros and ones. The corresponding regular expression might look like this:

(0|1)+

The parentheses are used to group multiple characters together, while the pipe ("|") denotes a choice, either the token on the left of the pipe is matched, or the token on the right of the pipe is matched. Finally, the "+" means "take the previous token or group and match it one or more times". So in total this expression matches "one or more repetitions of either zero or one". There are some more common symbols that are useful to know for regular expressions, a period (".") stands for any single character, a "*" is similar to "+" but will match zero or more times, allowing for optional sections of the expression. So for example the regular expression that will match any string greater than five characters could be written like:

......*

We could go on like this for some time, explaining the ins and outs of regular expressions, but we've almost got enough here to write an expression that can recognise images in html. To run the expression we're going to use the sre.findall function, which returns a list containing all matched groups in a string. Here's the code that finds our images:

matches = sre.findall('<img .*src="?(.*?)"?', website_text)

Now there's a few new things here: firstly we enclose the string in single quotes because we want to use double quotes inside it, as the start and end points of the path. Secondly, we group the information we're interested in with '(', ')', findall will throw away parts of the match that are not in a group and so we wont have to manually remove the info we don't need. Lastly we use the "?" character for two things — right after the quotes it means that the previous token is optional, and after the "*" it puts the expression matcher into non-greedy mode. By default regular expressions are greedy, that is, they will try to match as much text as they can, we want just the opposite, only match until the first " is found. Now if we open up the front page of http://www.builderau.com.au and run this expression we get the following results (your mileage may vary depending on what images are on the front page when you try it):

['http://ad.au.doubleclick.net/ad/popup.builderau.com.au/;sz=1x1;ord=123456789?', 
'/i/s/cov/checklist170110.jpg', '/i/s/cov/perl170110.jpg', '/i/s/cov/atwork170110.jpg', 
'/i/x/blogs/brendon_chase_52x52.gif', '/i/x/blogs/chris_duckett_52x52.gif', 
'http://dw.cbsimg.net/clear/redx/c.gif?ts=1163740722&sId=75']

We're getting there, but now we have another problem: paths to images in Web pages can be written either in absolute terms or relative to the location of the page. We can see this in the data above, some images have a full address starting with "http:// " while others have only directories. What our program should do is add the path to the page to the beginning of each relative link. Firstly we need to find the path to the page, but we've done most of the hard work already, so we can just ask urllib2 what the site is:

dir = website_handle.geturl().rsplit('/',1)[0]
if (dir == "http:/"):
        dir = website_handle.geturl()

We use the string function rsplit to remove the filename of the page itself. rsplit and its brother split take a string and split it into segments separated by a particular character, for example:

>>> "Hello World".split(' ')
['Hello', 'World']
>>> "Hello World".rsplit('l')
['He', '', 'o Wor', 'd']
>>> "Hello World".split('l',1)
['He', 'lo World']
>>> "Hello World".rsplit('l', 1)
['Hello Wor', 'd']

Uniqueness and Sorting

We've got all the links and we've qualified them fully, we just need to print the results in the best way. In this case we'll want to remove duplicates and sort the results. The easiest way to achieve uniqueness in Python is to store your data in a container that enforces it, such as the set. Sets are similar to lists, the difference being that their contents must be unique and they are not stored in any particular order. We're only using the basic features of the set:

>>> x = ['set', 'contents', 'hello', 'world']
>>> y = set(x)
>>> y
set(['world', 'set', 'hello', 'contents'])
>>> y.add("new item")
>>> y
set(['new item', 'world', 'set', 'hello', 'contents'])
>>> list(y)
['new item', 'world', 'set', 'hello', 'contents']

The important thing to notice here is that you can convert freely between lists and sets, but you may lose the order of the items in the list. Next we want to sort the data before we print it back to the user. There are many different sort algorithms, but we don't want to have to write one whenever we need to sort something, and for this reason the list module in Python has a sort function built in. It's not the best for all circumstances, if you need to sort a lot of things quickly you'll want to write your own, but it's good enough for most purposes. The list.sort function sorts in place so it modifies the list its called on rather than returning a new list:

>>> x = [1,5,3,6,7,2,8,9,4]
>>> x.sort()
>>> x
[1, 2, 3, 4, 5, 6, 7, 8, 9]

NB: Thanks to reader Jeremy who reminded me that sets were only introduced in Python 2.4. If you're still using Python 2.3 then you can achieve the same thing through a dict, here is the above example using a dictionary rather than a set:

>>> x = ['set', 'contents', 'hello', 'world']
>>> y = dict.fromkeys(x)
>>> y
{'world': None, 'set': None, 'hello': None, 'contents': None}
>>> y['new item'] = None
>>> y
{'new item': None, 'world': None, 'set': None, 'hello': None, 'contents': None}
>>> y.keys()
['new item', 'world', 'set', 'hello', 'contents']

The final Product

import sre, urllib2, sys, BaseHTTPServer

def parseAddress(input):
        if input[:7] != "http://":
                if input.find("://") != -1:
                        print "Error: Cannot retrive URL, address must be HTTP"
                        sys.exit(1)
                else:
                        input = "http://" + input

        return input

def retrieveWebPage(address):
        try:
                web_handle = urllib2.urlopen(address)
        except urllib2.HTTPError, e:
                error_desc = BaseHTTPServer.BaseHTTPRequestHandler.responses[e.code][0]
                #print "Cannot retrieve URL: " + str(e.code) + ": " + error_desc
                print "Cannot retrieve URL: HTTP Error Code", e.code
                sys.exit(1)
        except urllib2.URLError, e:
                print "Cannot retrieve URL: " + e.reason[1]
                sys.exit(1)
        except:
                print "Cannot retrieve URL: unknown error"
                sys.exit(1)
        return web_handle

if len(sys.argv) 

And there you have it, a finished program to grab all of the image addresses from a Web page. It's a small example, but you'll find that it follows a pattern that will serve you through most of the programs you write. First you check user input — here we check that the number of arguments are correct and that the address given is of the right type. Secondly, you collect and process your data — in this example by downloading a Web page and extracting addresses using regular expressions, and finally you organise and present the output.

Editor's Picks

Free Newsletters, In your Inbox