Developer

Faster, smaller, clearer: Python iterator tools

With Python's itertools module you can quickly and simply perform some of the more complicated operations you'll need to do on lists. It will make your code perform better and become easier to read.

After you've got more Python under your belt, you can start digging into the Python standard libraries for ideas on how to fine tune your functions. We'll show you how with some examples using the itertools module.

We've gone through the basics of using list comprehensions, and using generators to iterate over a sequence without calculating it all at once. With the itertools module you can quickly and simply perform some of the more complicated operations you'll need to do on lists. We'll look at some of the functions provided by the module, and work on writing small functions to compress and pack lists.

compress()

Let's start with a function that simply removes all consecutive duplicates in a list. An initial attempt might look something like:

def compress(seq):
	result = []
	for i in range(len(seq)-1):
		if seq[i] != seq[i+1]:
			result.append(seq[i])
	return result

It's a fairly standard iterative solution, the kind you could write quite happily in Java or .NET and be confident it's the best way to do it. Let's quickly test the performance by trying out our compress function on a big list of numbers:

>>> long = [random.randint(0,50) for i in xrange(2000000)]
>>> i = time.clock(); q = compress(long); print time.clock() - i;
2.11

First we build a list of two million random elements between 0 and 50, then we check the clock before and after compressing it. It takes a little over two seconds on my machine, not bad really.

It doesn't seem very pythonic though, something should tell you that there's a better way to do things. Let's try to do the same thing but using a list comprehension and the built in zip() function. If you're not familiar with zip, here it is in action:

>>> zip(['a','b','c'], [1,2,3])
[('a', 1), ('b', 2), ('c', 3)]

Our second compress function looks like this:

def compress2(x):
	return [a for a,b in zip(x, x[1:]+[None]) if a != b]

The trick here is that we create a list of pairs, each containing two consecutive elements in the list. Then we include the first item in each pair so long as it is different from the second. It's definitely shorter, and arguably cleaner, but how does it go performance wise?

>>> i = time.clock(); q = compress2(long); print time.clock() - i;
8.31

Much worse — building that list of pairs is clearly very expensive, and you can see why, that list of two million items is now two million pairs of items. What we really want to do is generate the list of pairs piece by piece. We could write a generator that works like the zip function, but why do that when Python gives us one in the itertools module: izip(). Using izip we can write compress like so:

from itertools import *
def compress3(x):
	return [a for a,b in izip(x, x[1:]+[None]) if a != b]

Is this actually any better though?

>>> i = time.clock(); q = compress3(long); print time.clock() - i;
0.76

It is! It's quicker, it's clearer and it uses less memory.

pack()

Can we apply this approach to another kind of function, such as pack? Pack is a similar function that takes a list, and breaks consecutive runs of elements into sub lists. Pack works like follows:

>>> pack([1,1,2,3,4,4,1,1,2,2,3])
[[1,1],[2],[3],[4,4],[1,1],[2,2],[3]]

One method of implementing this function could be to loop through the values, keeping track of the previous values and adding to sublists if the last was equal to the current. A first attempt would look something like:

def pack(seq):
	results = []
	temp = []
	last = None
	for current in seq:
		if current == last:
			temp.append(current)
		else:
			results.append(temp)
			temp = [current]
		last = current
	results.append(temp)
	return results

That works fine enough, let's see how quick it is.

>>> i = time.clock(); q = pack(long); print time.clock()-i
17.34

That's not too good, can we do the same thing as with compress and move all of the work into a function from itertools? Here's an implementation using the groupby function, which returns a tuple containing firstly the value of each item in the list ignoring duplicates (like compress) and a list containing all the consecutive duplicates (like pack). So we can just take the second half of each tuple:

def pack2(l):
	return [list(group) for name, group in groupby(l)]

When we run it:

>>> i = time.clock(); q = pack2(long); print time.clock()-i
36.41

That's much worse, since we're doing all this extra work and throwing it away. What we really want to do is use the compress function we've already written to help, since we know it's fast.

def pack3(seq):
	it = iter(seq)
	return [list(takewhile(lambda y: y==x, it))+[x] for x in compress3(seq)]

This version is almost as compact, but there's a lot going on here: let's break it down piece by piece. First we take the sequence and create an iterator — this allows us to keep our position as we peel off sublists, we only need to go through the list twice, no need to start from the beginning every time we loop. The main loop is implicitly caused by a list comprehension — using compress, we get a list of what the content of each sub list will be, we just need to put the right items in the list.

That's where the takewhile generator function comes in (it's in itertools too), which takes a function and a sequence and keeps returning items from the sequence while the function is false. Here's an example to illustrate this:

>>> list(takewhile(lambda x: x > 3,[5,6,4,3,1,1,1,3,4]))
[5, 6, 4]
>>> list(takewhile(lambda x: x 

Those lambdas are just quick python shortcuts for defining functions, you can read more about them in the Python reference manual.

Now that we've worked on a new version we should check how quick it is:

>>> i = time.clock(); q = pack3(long); print time.clock()-i
14.23

That's better, but it looks like there's still some room for improvement. I'll leave it up to you guys.

Just from these small examples you can see that by using iterators, and having a good knowledge of the libraries that come with your Python distribution, like itertools, you can be much more efficient, both in terms of time to write code and time to run code. There's a lot more in the itertools module, we've only gone through three of the fourteen functions in itertools, for the full scoop, take a look at the Python library documentation page.

Editor's Picks

Free Newsletters, In your Inbox