Software

Run length encoding in Python

Data compression is a must in modern communication. Run length encoding is one of the simplest effective forms of compression. We'll show you how you can implement it in just a few lines of Python.

Data compression is a must in modern communication. Run length encoding is one of the simplest effective forms of compression. We'll show you how you can implement it in just a few lines of Python.

Without data compression, many of the communication formats we take for granted would not be possible. Compare the size of an uncompressed image (say in .BMP or .TIFF format) with a compressed format like .JPEG or .PNG — the compressed image is often more than 20 times smaller, making it much faster and cheaper to transmit over the wires.

Run length encoding is a simple form of data compression, where consecutive elements, or runs, are replaced by just one element showing how many are in the run. For example:

"aaaabbaaa" -> "4a2b3a"

Here the string "aaaabbaaa" becomes "4a2b3a", or four "a"s, two "b"s and three "a"s — nine characters becomes six. Run length encoding is most effective when you're compressing data with few symbols (like binary numbers or runs of black/white pixels on an image) when we can expect symbols to appear in groups.

We'll write a function encode that takes in a list, and returns a list of tuples, the first item being the number of times an element occurs, and the second being the element itself, for example:

>>> encode("aaaabbaaa")
[(4, 'a'), (2, 'b'), (3, 'a')]

In my last article I looked at how the itertools.groupby function can be useful in a range of functions involving consecutive elements in a list. Again, that's exactly what we have here. We can use groupby to run length encode a list in one line:

from itertools import *
def encode(l):
	return [(len(list(group)),name) for name, group in groupby(l)]

Next up we need to be able to decode these encoded strings. The decode function has two parts: firstly we need to convert a tuple of (length,item) into a list of length items, and secondly we need to stitch all of these lists together. For the first part we can take advantage of the * operator on lists:

>>> 4*['a']
['a', 'a', 'a', 'a']
>>> [length*[item] for length,item in encode("aaaabbaaa")]
[['a', 'a', 'a', 'a'], ['b', 'b'], ['a', 'a', 'a']]

Then to flatten this list, we can use a simple trick which takes advantage of how the + operator works on lists — the sum function. Sum simply adds all items of a list to a given starting value, we can use it to add all items in the list to the empty list:

>>> sum([['a','a','a'],['b']], [])
['a', 'a', 'a', 'b']
>>> sum([length*[item] for length,item in encode("aaaabbaaa")],[])
['a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a']

There's our decode function, in one line:

def decode(l):
	return sum([length * [item] for length,item in l],[])

Let's test that this works as expected:

>>> decode(encode("aaaabbaaa"))
['a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a']
>>> decode(encode([1,2,1,1,1,2,2,2,2,3,3,3,4,3]))
[1, 2, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 3]
>>> encode([1,2,1,1,1,2,2,2,2,3,3,3,4,3])
[(1, 1), (1, 2), (3, 1), (4, 2), (3, 3), (1, 4), (1, 3)]
>>> decode([(1, 1), (1, 2), (3, 1), (4, 2), (3, 3), (1, 4), (1, 3)])
[1, 2, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 3]
>>> encode("Hello world!")
[(1, 'H'), (1, 'e'), (2, 'l'), (1, 'o'), (1, ' '), (1, 'w'), (1, 'o'), (1, 'r'), (1, 'l'), (1, 'd'), (1, '!')]

This last example illustrates a problem with our encoding, when the input data does not cluster groups of items together it becomes bloated with extra information, defeating the purpose of compressing it. What we really want is a modified run length encoding function, that only encodes items if they appear in groups of two or more, otherwise leaving them as they are.

One way to write this is to encode the list fully, and then decode the items that have a length of one:

def encode_modified(l):
    def one_uncode(x):
		length,name = x
		if length == 1:
			return name
		return x
	return [one_uncode(i) for i in encode(l)]

Then to decode, we need to be able to make a distinction between the items that are encoded and those that have been left as they were. The difference between them is that the encoded items are tuples, and the ones that have been left (presumably — more on this later) are not. We can use the built-in isinstance function to determine this:

>>> isinstance([],list)
True
>>> isinstance([],tuple)
False
>>> isinstance((1,'a'), tuple)
True

Then we just need to write a function that will expand a tuple and leave anything else alone, and then call it on each item in the list.

def decode_modified(l):
    def recode(x):
        if isinstance(x,tuple):
            length,name = x
            return [name]*length
        return x
    return sum([recode(i) for i in l],[])

That's it. Be careful though, these functions will only work when your data stream is not already a list of tuples — if it is, you'll need to create a new class to store the packed data, but be wary that it comes with additional space overhead.

Editor's Picks

Free Newsletters, In your Inbox