In our last article on compression we showed you how to demonstrate run time encoding in Python. In this article we’ll show you how to implement another kind of compression, Huffman encoding, which is useful when dealing with small sets of items, such as character strings.

Huffman encoding is a favourite of university algorithms courses because it requires the use of a number of different data structures together. We’ll be using the python heapq library to implement a priority queue, so if you’re unfamiliar with that library, go back and read our previous guide.

The Huffman encoding algorithm has two main steps:

  • Create a binary tree containing all of the items in the source by successively combining the least occurring two elements in the list until there is only one node which all others spring from.
  • For each item in the sequence, walk down the tree from the root to the node labelled with the item — add a 0 for each time you take the left branch and a 1 for each time you take the right branch.

At the end of this process you’ll have a binary string, which can be decoded by using the tree and reversing the last step. Using Huffman encoding the more commonly occurring items are given shorter binary strings, whereas the standard ASCII character encoding makes all characters have encodings of the same length.

First, we need to build the tree. Tree nodes are simple objects, but they need to keep track of four things: Which item they store (if any), the combined weight of them and their children, and their left and right child nodes. We’ll implement a class to store this information:

class Node(object):
left = None
right = None
item = None
weight = 0

def __init__(self, i, w):
self.item = i
self.weight = w

def setChildren(self, ln, rn):
self.left = ln
self.right = rn

def __repr__(self):
return "%s - %s -- %s _ %s" % (self.item, self.weight, self.left, self.right)

def __cmp__(self, a):
return cmp(self.weight, a.weight)

We define the __repr__ function so we can print out the status of the nodes (allowing you to debug the tree), and the __cmp__ function so we can use the heapq module to order the nodes.

Now we need to build the tree: To do this firstly we need to create nodes for each item in the list, then be able to find the nodes with the smallest weights so we can combine them.

We’ll use the groupby function of the itertools module to calculate the original weights, then use a heapq priority queue to rank the nodes. In this program, the source string is called input.

from itertools import groupby
from heapq import *

itemqueue = [Node(a,len(list(b))) for a,b in groupby(sorted(input))]
while len(itemqueue) > 1:
l = heappop(itemqueue)
r = heappop(itemqueue)
n = Node(None, r.weight+l.weight)
heappush(itemqueue, n)

At the end of this step, itemqueue has only one element, the root node of the tree.

Next we need to walk through the tree to work out the encoding for each item. Rather than go through the tree for each character, we’ll traverse the whole tree in one go and store the results in a dictionary:

codes = {}

def codeIt(s, node):
if node.item:
if not s:
codes[node.item] = "0"
codes[node.item] = s
codeIt(s+"0", node.left)
codeIt(s+"1", node.right)


We use a recursive function to accumulate the encoding as we walk down the tree and branch at each non-leaf node.

If we put the above two code fragments in a function called huffman, then we can return the relevant information using the following:

def huffman(input):
# above code 

return codes, "".join([codes[a] for a in input])

That’s about it. The function returns both the dictionary from item to encoding (invert it and you can use it to decode) and the encoded string. Lets try it on a few examples:

>>> from huffman import *
>>> input = "aababcabcd"
>>> huffman(input)
({'a': '0', 'c': '111', 'b': '10', 'd': '110'}, '0010010111010111110')
>>> len(input) * 8, len(huffman(input)[1])
(80, 19)
>>> input = "the quick brown fox jumps over the lazy dog" 
>>> huffman(input)
({' ': '00', 'a': '10000', 'c': '10011', 'b': '110000', 'e': '1010', 'd': '111111', 'g': '110111', 'f': '01001', 'i': '110101', 'h': '11110', 'k': '110011', 'j': '111110', 'm': '110001', 'l': '110010', 'o': '1011', 'n': '01000', 'q': '10010', 'p': '110110', 's': '110100', 'r': '11101', 'u': '0101', 't': '11100', 'w': '01111', 'v': '10001', 'y': '01100', 'x': '01110', 'z': '01101'}, '111001111010100010010010111010110011110011001100001110110110111101000000100110110111000111110010111000111011011010000101110001101011101001110011110101000110010100000110101100001111111011110111')
>>> len(input) * 8, len(huffman(input)[1])
(344, 192)
>>> import urllib
>>> input = urllib.urlopen("").read()
>>> len(input) * 8, len(huffman(input)[1])
(57808, 36340)
>>> input = file("wrnpc12.txt").read()
>>> len(input) * 8, len(huffman(input)[1])
(26281320, 14962010)

The first two show that the less characters you have in the input string, the better the compression is. The last two are tests on larger sources — firstly the python documentation Web site — which has a lot of characters, including punctuation, and secondly the entire text of “War and Piece”, where using huffman encoding halves the size of the resulting string.

Huffman encoding is another approach to compression, rather than work on repetitions, it works on limiting the size of the character set. If your source contains a large range of characters then the advantages gained by Huffman encoding are greatly reduced.