Google is obsessed with phishing, thankfully

Google has a significant stake in the Internet and whether it survives or not. Could that be why the search giant has declared all out war against the phishers of the world?

Google has a significant stake in the Internet and whether it survives or not. Could that be why the search giant has declared all out war against the phishers of the world?


People associate phishing with identity theft. That's bad enough, but there is something else to consider. If phishing continues to be successful, people will be afraid to do anything online, especially when it requires disclosing personal information.

Businesses, financial establishments, and companies who exist because of the Internet are keenly aware of this. It seems that Google, one such company, has decided to bring their vast arsenal of technology to bear on the problem of phishing.

There are two reasons why I am interested in Google's approach to anti-phishing. First, their Anti-Phishing Team has been able to automate the black-listing process, no small feat. Second, they are finally talking about how they do it. Their method involves two parts, a client-server interface and the backend data base. Let's look at the client-side service first.

Client side

The client interface is Google's Safe Browsing API. It has been working quietly in the background for several years. A fact many do not realize. Three of the four (edited to reflect Internet Explorer's ranking) big-name Web browsers Firefox, Safari, and Chrome use it. Google defines the Safe Browsing service as:

"At a high level, the service works by checking each URL the client loads against a list of known phishing and malware sites. The list of known sites is represented as host-suffix / path-prefix expressions.

As the name suggests, these expressions can match arbitrary URLs as long as they have the required host suffix and path prefix. This approach helps protect against sites where the attacker uses many different URLs in order to try to evade blacklists."

The following diagram (courtesy of Google) is a visual description of the look-up process:

The client-server handshake is the easy part. Trying to keep the black list current, have minimal mistakes, and even fewer false positives is where it gets tricky.

Back end

Until recently, Google has kept mum about how their black list is populated. I first learned about it through a Google blog post. It pointed me to the paper Large-Scale Automatic Classification of Phishing Pages (pdf). Now public, after being presented by Colin Whittaker, Brian Ryner, and Marria Nazif (all members of Google's Anti-Phishing Team) at the 17th Annual Network and Distributed System Security Symposium.

Right away in the report, the team discusses what is needed for the black list to be effective:

  • Comprehensive: A blacklist that is not comprehensive fails to protect a portion of its users.
  • Error free: False positives subject users to unnecessary warnings. Eventually, the users will ignore the warnings.
  • Timely: The black list must update in real-time. As most phishing sites are up for less than a day.

The report goes on to explain that the automatic classifier (back-end algorithmic process) uses the following Web-page elements in the decision-making process:

  • Page URL: Look for anything odd about the hostname. Is it unusually long or possibly contain an IP address.
  • Page content: The page is checked to see if it has a password and or PIN field. Additionally, the page is checked for links that may be pointing at a known phishing domain.
  • TF-IDF score: TF-IDF is a ranking method used when automatically scanning for phishing sites. Through the magic of mathematics, important terms like "password" or "PIN" are given more weight.
  • Hosting information: What network hosts the Web site and where the Web servers are located geographically can be telling. For example, I'd be concerned if the Web server for an American bank is in a different country.
  • PageRank: PageRank is used to determine the spam reputation of the page's domain. Apparently, the Anti-Phishing Team has discovered a relationship between phishing pages and domains that send spam.

That's quite a list of things to check. More than I would care to check each time I go to a new Web site.


Automating the search for the above elements is the first step the Anti-Phishing Team did and probably the simplest. The classifier then takes the information and ranks the URL, from 0.0 not at all phishing to 1.0 definitely phishing. Finally, software called the Blacklist Aggregator prepares the list to be served to the clients.

What really makes this system effective is how the classifier is retrained every day to pick up new phishing trends. Google explains:

"As a training data set, we use a sample of roughly ten million URLs analyzed by the classification workflow over the past three months along with the features obtained at the time."

The report goes on to explain how the training data set is manipulated to test the classifier and make sure it is providing the most accurate results possible. From what I understand, the training process is the heart of the classifier and what separates Google's approach from others.

A different take

Another Google blog post I came across looks at phishing differently.

The post explains how Web-site designers can minimize the chance of having their work trigger anti-phishing scanners. After reading the post, I realized these points are something we should keep in mind as well:

  • Beware of username and password requests that are not specifically for that Web site.
  • Be leery of logos near login fields that are not related to the Web site.
  • Links to other Web pages should be readily viewable and related to the site's domain page.

The above bullet points are important, but easily missed. I was almost tricked by a password request that had nothing to do with the Web site I was viewing.

Report's conclusions

A short while ago, I wrote a piece about users and rejecting security advice. One of the premises I wrote about is how difficult it is to keep track of all the anti-phishing rules, so we don't. I am heartened by reports like this one from the Anti-Phishing Team. Their conclusions offer the following encouragement:

"In this paper, we describe our large-scale system for automatically classifying phishing pages which maintains a false positive rate below 0.1%. Our classification system examines millions of potential phishing pages daily in a fraction of the time of a manual review process. By automatically updating our blacklist with our classifier, we minimize the amount of time that phishing pages can remain active before we protect our users from them.

Even with a perfect classifier and a robust system, we recognize that our blacklist approach keeps us perpetually a step behind the phishers. We can only identify a phishing page after it has been published and visible to Internet users for some time. However, we believe that if we can provide a blacklist complete enough and quickly enough, we can force phishers to operate at a loss and abandon this type of Internet crime."

Final thoughts

Automated filters aimed at reducing phishing attacks are vital to the existence of the Internet as we know it. There may be other answers, but until they are turned into working systems, this seems like our best bet.

I also feel the more informed we are about phishing, the safer we will be. Call it the belt and suspenders approach.