CAPTCHAs may be the de facto standard for protecting against spammer scripts, but that doesn't mean they're always the best tools to use.
CAPTCHA, or "Completely Automated Public Turing test to tell Computers and Humans Apart", is easily the most widely used approach to testing site visitors on the Web to determine whether they are humans or scripts. As scripts become more sophisticated in their text recognition for purposes of solving CAPTCHAs, the CAPTCHAs themselves have to get harder to read, to one-up the scripts and keep spammers off their sites.
There are two problems with this arms race, from the point of view of the defenders, that are becoming huge issues:
- CAPTCHAs are getting too difficult even for humans to read, let alone scripts. While this may help keep the scripts out, it also keeps the humans out, which generally defeats the purpose of the site. As the arms race progresses, we may soon reach a point where the scripts are better at solving CAPTCHAs than the humans.
- Spammers are starting to use humans to augment their scripts, crowdsourcing their CAPTCHA solving. Human micro-employment services like Amazon's Mechanical Turk are actually being used to get real human beings to solve CAPTCHAs. Sometimes, this is just someone testing CAPTCHA generators to ensure that humans are capable of reading them, but CAPTCHA solving for spammers is also being crowdsourced, which means that differentiating between human and code is no longer sufficient. What difference does it make how well you differentiate between people and scripts if the spammers are getting humans to do their work for them anyway?
Other approaches than CAPTCHA are needed, if we wish to continue actually accomplishing something aside from encouraging the development of strong AI in malware. Some special case examples have already arisen, and tend to be less annoying and intrusive in the context of human users' Web browsing experience. Two that I have used include:
- Source checking for trackbacks and pingbacks to test for legitimacy.
- Using hidden fields to catch scripts reading what humans do not see.
Trackback and pingback source checking
For protecting a Weblog against spam trackbacks and pingbacks, a source checker — checking the source site of the trackback or pingback, that is, and not the source code for anything — can be invaluable. Trackbacks and pingbacks are automated means of having your Weblog post links back to itself in other people's Weblogs. The way it works, in theory, is that someone sets up a Weblog that accepts trackbacks and pingbacks. You, then, set up your own Weblog that takes advantage of this functionality. When the other guy posts something interesting, you read it and feel inspired to say something about it. You post something in your own Weblog, with a link to the other guy's post. A ping or trackback notification is sent, and the other site tucks a reference to your post into its comment stream or a special trackback section (depending on how the Weblog was set up). This way, the two Weblogs automatically link to each other, resulting in a mutually beneficial relationship for improving search engine rankings and otherwise advertising each others' material.
In practice, far more spammy stuff gets posted to unprotected Weblogs that accept comments and trackbacks or pingbacks via those trackbacks or pingbacks than via comments. This bypasses the CAPTCHA gatekeeper entirely. The spammer simply writes a script that sends a trackback notification or ping to your Weblog, identifying some page as the relevant source of the supposed link to your site. Of course, there is no such link there, nor is there anything relevant to your Weblog post; if there was such a link, the thousands of different sites being spammed with these trackbacks and pingbacks would soon fill the page with nothing but links to others' Weblogs, and the evil SEO plans of the spammer would be defeated.
By setting up your Weblog's trackback or pingback feature to ensure there is at least a link to your Weblog on the source site, you can ensure that the majority of these things will be rejected outright without ever having to lay eyes on them yourself. The few that get through will probably be borderline cases where at least the guy at the other end has a link to your site on the page somewhere. Whether you want to accept that is up to you.
Unlike the previous example, this should work everywhere there is a Web form — with some caveats.
Finally, the weakest approach is to use the <input type="hidden"> element in your form. This is generally the easiest for a spammer script to recognize, though it is also the least likely to trip up a genuine human user.
There are surely dozens of other ways to automatically exclude scripts while letting the human users in — ways that, at least in certain cases, work much better than CAPTCHAs. While CAPTCHAs are certainly effective a lot of the time in weeding out nonhuman visitors, they are also effective at throwing out the wheat with the chaff sometimes, and the frequency of this problem arising seems to be increasing.
Consider your options. If you choose to use a CAPTCHA system, consider using one that is (almost?) guaranteed to let all humans through the gate, even if it lets in some spammer scripts as well; use an alternative technique or two that should not get in the way of human users in the vast majority of cases to help weed out the rest. Consider also that perhaps you should just use something particular to your needs, that works better for them than CAPTCHAs in those circumstances, and leave the annoying extra step of CAPTCHA solving out of the end user experience entirely.
If you really want to help contribute to the development of artificial intelligence, though, you can put your CAPTCHA in a hidden element on the page, so only the scripts will see it.