Developer

Python groupby, the iterator swiss army knife

The groupby function is useful for a range of needs, but one of the best uses for it is in replicating the UNIX filter uniq in Python.

The groupby function is useful for a range of needs, but one of the best uses for it is in replicating the UNIX filter uniq in Python.

Last week we looked at the itertools module and how its iterator based functions can be faster than doing things from scratch. One of the examples showed a function using itertools.takewhile to be much faster than one using itertools.groupby. Some feedback we got, asked, what then is the point of groupby if it's just a slower version of other functionality?

Groupby is useful for all kinds of things, if you're only going to be able to remember one function in itertools, groupby is the one that's going to let you get the most done.

For example, the compress function from the previous article can be written simply as:

def compress(l):
	return [a for a,b in groupby(l)]

It's not the fastest, but if your data set isn't huge it may just be quick enough. We tested it on a list of two million numbers and the difference was less than a minute.

Likewise the pack function is just as simple:

def pack(l):
	return [list(b) for a,b in groupby(l)]

Let's just remind ourselves about how these work:

>>> import random
>>> from itertools import *
>>> x = [random.randint(0,5) for x in range(20)]
>>> x
[3, 5, 1, 2, 4, 4, 4, 1, 4, 5, 2, 5, 3, 4, 5, 0, 5, 0, 2, 1]
>>> [(a,list(b)) for a,b in groupby(x)]
[(3, [3]), (5, [5]), (1, [1]), (2, [2]), (4, [4, 4, 4]), (1, [1]), (4, [4]), (5, [5]), (2, [2]), (5, [5]), (3, [3]), (4, [4]), (5, [5]), (0, [0]), (5, [5]), (0, [0]), (2, [2]), (1, [1])]
>>> compress(x)
[3, 5, 1, 2, 4, 1, 4, 5, 2, 5, 3, 4, 5, 0, 5, 0, 2, 1]
>>> pack(x)
[[3], [5], [1], [2], [4, 4, 4], [1], [4], [5], [2], [5], [3], [4], [5], [0], [5], [0], [2], [1]]

If you're familiar with the terminal, you may notice that this is very similar to the unix "uniq" command, which works just like compress on lines of text. One of the most common things to do with uniq is combine it with sort, and we can do that here as well:

>>> [a for a,b in groupby(sorted(x))]
[0, 1, 2, 3, 4, 5]
>>> [list(b) for a,b in groupby(sorted(x))]
[[0, 0], [1, 1, 1], [2, 2, 2], [3, 3], [4, 4, 4, 4, 4], [5, 5, 5, 5, 5]]

Another common use for uniq is to count the number of occurrences of each item in the list, but using "sort | uniq -c". You can do that with groupby as well:

>>> [(a,len(list(b))) for a,b in groupby(sorted(x))]
[(0, 2), (1, 3), (2, 3), (3, 2), (4, 5), (5, 5)]

If you need that info in a dictionary — say a mapping between object and number of occurrences, you just need to wrap that list in a dict() object:

>>> dict([(a,len(list(b))) for a,b in groupby(sorted(x))])
{0: 2, 1: 3, 2: 3, 3: 2, 4: 5, 5: 5}

Last item on the agenda for today, returning only the duplicates in a list — just like "uniq -d". We're using list comprehensions, so we can use the filtering inbuilt to only take groups with more than one element:

>>> [a for a,b in groupby(x) if len(list(b)) > 1]
[4]

Writing a uniq substitute is just one of the things you can do with groupby. Check out the Python library documentation for more ideas.

Correction: The original version was missing return statements in the compress and pack functions. Thanks for pointing this out readers.

Editor's Picks

Free Newsletters, In your Inbox