There are times when your programs need to access the Web without worrying about the details of the mark-up. In this example we write a HTML scraper using the Python parsing library BeautifulSoup.
The Web holds a truly awe inspiring amount of information, which we're all usually happy enough to access through our Web browser. There are times, however, where your programs need to access it, and you don't want to worry about the details of the HTML mark-up.
There are thousands of HTML (or SGML, or XML) parsing libraries for hundreds of languages out there, but for this example we use a Python library called BeautifulSoup which takes care of almost all of the work for you. The BeautifulSoup library is an extremely helpful tool to have at your disposal, since it not only gives you functions to search and modify your parse tree, but it also handles the broken and malformed HTML you're likely to encounter on an average Web page.
You can download the library at its Web page. It also resides in some popular software repositories, such as the apt-get repository used in the Debian and Ubuntu distributions.
We'll write a Web scraper that prints all the displayed text contained within <p> tags. This is a very simple implementation that is easy to trip up, but it should be enough to demonstrate how using the library works.
First up, we need to retrieve the source of the page that we want to scrape. The following code will take an address given on the command line and put the contents into the variable html:
import urllib2,sys address = sys.argv html = urllib2.urlopen(address).read()
Then we need to build a parse tree using BeautifulSoup:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html)
At this point the code has already been cleaned up and converted to unicode by the BeautifulSoup library, you can print soup.prettify() to get a clean dump of the source code.
Instead, what we want is to print all of the text, without the tags, so we need to find out which parts of the parse tree are text. In BeautifulSoup there are two kinds of nodes in the parse tree, plain text is represented by the NavigableString class, whereas Tags hold mark-up. Tags are recursive structures, they can hold many children, each being either other Tags or NavigableStrings.
We want to write a recursive function that takes part of the tree: if it is a NavigableString print it out, otherwise, runs the function again on each subtree. Because we can iterate over a tag's children simply by referring to that tag this is easy.
from BeautifulSoup import NavigableString def printText(tags): for tag in tags: if tag.__class__ == NavigableString: print tag, else: printText(tag)
Then we just need to run that function on all the <p> tags. We can use BeautifulSoup's in built parse tree searching functions to retrieve all of them:
That's it. You've got a fully functioning, if basic, HTML scraper. For more help with searching the parse tree, look up the BeautifulSoup documentation.
The full code for this example is as follows:
from BeautifulSoup import BeautifulSoup,NavigableString import urllib2,sys address = sys.argv html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) def printText(tags): for tag in tags: if tag.__class__ == NavigableString: print tag, else: printText(tag) print "" printText(soup.findAll("p")) print "".join(soup.findAll("p", text=re.compile(".")))