Computer vision research is flourishing — delivering machines that recognise who and what is in pictures almost as well as — and occasionally better than — humans.

Cheap and readily available computing power combined with models for machine learning systems — known as convolutional neural networks (CNNs) — have enabled a leap forward in the efficacy of computer vision in recent years.

On many tasks machines can identify faces, animals and things more than 90 percent of the time but researchers continue to refine systems, pushing for ever-more accurate performance.

Take spotting house numbers: even machines can find it tricky to decide whether that’s a ‘2’ or a ‘7’ in a blurry image of a door.

On Thursday, Google Deepmind published its latest paper on how it achieved state-of-the-art performance in picking out house numbers from a database of more than 200,000 images taken from Google Street View.

Google achieved unmatched results in this and another computer vision challenge by refining the CNN used to match images.

Deepmind researchers added a module to their CNN that learns how to manipulate images to make people, animals and things easier to identify — removing distortion, noise and clutter and enlarging and rotating areas of interest. This spatial transformer module removes much of the extraneous detail that could hamper identification — forwarding on only the parts of the image that are useful.

For example, when applied to a CNN tasked with recognising images of traffic signs, the module allowed the CNN to learn to focus on the sign and gradually remove the background. In another task, the spatial transformer learned how to identify and single out heads and bodies of birds in a collection of images.

The improvements delivered by the spatial transformer module are modest but are pushing the boundaries of already impressive performance to achieve state-of-the-art results. In the Street View House Numbers task the machine vision system has to identify up to five digit house numbers from over 200,000 Steet View images. By adding a spatial transformer to a CNN, Google reduced error rates from 3.9 percent to 3.6 percent. Similarly when identifying which of 200 different species of bird were pictured in the thousands of images in the CUB-200-2011 collection, Google’s system achieved an accuracy rate of 84.1 percent, some 1.8 percent better than the previous best result.

Google uses CNNs for various computer vision related tasks, with the Google Photos app relying on a deep learning-based system to automatically recognize, classify and organize images. In Google’s position as a custodian of the world’s data, the search giant is also researching deep learning techniques to greatly improve object detection, classification and labeling.

Google Deepmind hit the headlines recently for developing a machine learning system capable of mastering Go – an ancient Chinese game whose complexity stumped computers for decades.

Read more on AI…