Social Enterprise

There may be a better way to weed out spammers than CAPTCHA

CAPTCHAs may be the de facto standard for protecting against spammer scripts, but that doesn't mean they're always the best tools to use.

CAPTCHA, or "Completely Automated Public Turing test to tell Computers and Humans Apart", is easily the most widely used approach to testing site visitors on the Web to determine whether they are humans or scripts. As scripts become more sophisticated in their text recognition for purposes of solving CAPTCHAs, the CAPTCHAs themselves have to get harder to read, to one-up the scripts and keep spammers off their sites.

There are two problems with this arms race, from the point of view of the defenders, that are becoming huge issues:

  1. CAPTCHAs are getting too difficult even for humans to read, let alone scripts. While this may help keep the scripts out, it also keeps the humans out, which generally defeats the purpose of the site. As the arms race progresses, we may soon reach a point where the scripts are better at solving CAPTCHAs than the humans.
  2. Spammers are starting to use humans to augment their scripts, crowdsourcing their CAPTCHA solving. Human micro-employment services like Amazon's Mechanical Turk are actually being used to get real human beings to solve CAPTCHAs. Sometimes, this is just someone testing CAPTCHA generators to ensure that humans are capable of reading them, but CAPTCHA solving for spammers is also being crowdsourced, which means that differentiating between human and code is no longer sufficient. What difference does it make how well you differentiate between people and scripts if the spammers are getting humans to do their work for them anyway?

Other approaches than CAPTCHA are needed, if we wish to continue actually accomplishing something aside from encouraging the development of strong AI in malware. Some special case examples have already arisen, and tend to be less annoying and intrusive in the context of human users' Web browsing experience. Two that I have used include:

  • Source checking for trackbacks and pingbacks to test for legitimacy.
  • Using hidden fields to catch scripts reading what humans do not see.

Trackback and pingback source checking

For protecting a Weblog against spam trackbacks and pingbacks, a source checker -- checking the source site of the trackback or pingback, that is, and not the source code for anything -- can be invaluable. Trackbacks and pingbacks are automated means of having your Weblog post links back to itself in other people's Weblogs. The way it works, in theory, is that someone sets up a Weblog that accepts trackbacks and pingbacks. You, then, set up your own Weblog that takes advantage of this functionality. When the other guy posts something interesting, you read it and feel inspired to say something about it. You post something in your own Weblog, with a link to the other guy's post. A ping or trackback notification is sent, and the other site tucks a reference to your post into its comment stream or a special trackback section (depending on how the Weblog was set up). This way, the two Weblogs automatically link to each other, resulting in a mutually beneficial relationship for improving search engine rankings and otherwise advertising each others' material.

In practice, far more spammy stuff gets posted to unprotected Weblogs that accept comments and trackbacks or pingbacks via those trackbacks or pingbacks than via comments. This bypasses the CAPTCHA gatekeeper entirely. The spammer simply writes a script that sends a trackback notification or ping to your Weblog, identifying some page as the relevant source of the supposed link to your site. Of course, there is no such link there, nor is there anything relevant to your Weblog post; if there was such a link, the thousands of different sites being spammed with these trackbacks and pingbacks would soon fill the page with nothing but links to others' Weblogs, and the evil SEO plans of the spammer would be defeated.

By setting up your Weblog's trackback or pingback feature to ensure there is at least a link to your Weblog on the source site, you can ensure that the majority of these things will be rejected outright without ever having to lay eyes on them yourself. The few that get through will probably be borderline cases where at least the guy at the other end has a link to your site on the page somewhere. Whether you want to accept that is up to you.

Hidden fields

Unlike the previous example, this should work everywhere there is a Web form -- with some caveats.

The efficacy of this approach varies depending on implementation and the type of users your site targets. The most effective means of using this form of protection against malicious spammer scripts so far involves JavaScript, which may also (unfortunately) serve as a way to turn away users with JavaScript turned off or blocked in their browsers, or at least make things a little more confusing.

The idea is simple: most scripts that crawl the Web (including search engine spiders) do not include JavaScript engines that pre-parse the pages before the script does its thing. Include a field in your Web form that, if it contains any value, causes the submitted data to be rejected. Make that field look like something that needs to be filled in. Then, use JavaScript to ensure it never shows up on the page. A spammer script that does not handle JavaScript will see the field and try to fill it in; a human using a JavaScript enabled browser will never see the field, and all will be well with the world.

Another approach involves CSS instead of JavaScript. Set the visibility property of the relevant form field to hidden. Browsers that handle CSS should then ensure that human beings never see that form field, while scripts that do not handle CSS would see the field and try to fill it in, failing to bypass your protections. To maximize the effectiveness of this approach, use a separate stylesheet rather than putting your styling information for the form on the HTML source page itself.

Finally, the weakest approach is to use the <input type="hidden"> element in your form. This is generally the easiest for a spammer script to recognize, though it is also the least likely to trip up a genuine human user.

Others

There are surely dozens of other ways to automatically exclude scripts while letting the human users in -- ways that, at least in certain cases, work much better than CAPTCHAs. While CAPTCHAs are certainly effective a lot of the time in weeding out nonhuman visitors, they are also effective at throwing out the wheat with the chaff sometimes, and the frequency of this problem arising seems to be increasing.

Consider your options. If you choose to use a CAPTCHA system, consider using one that is (almost?) guaranteed to let all humans through the gate, even if it lets in some spammer scripts as well; use an alternative technique or two that should not get in the way of human users in the vast majority of cases to help weed out the rest. Consider also that perhaps you should just use something particular to your needs, that works better for them than CAPTCHAs in those circumstances, and leave the annoying extra step of CAPTCHA solving out of the end user experience entirely.

If you really want to help contribute to the development of artificial intelligence, though, you can put your CAPTCHA in a hidden element on the page, so only the scripts will see it.

About

Chad Perrin is an IT consultant, developer, and freelance professional writer. He holds both Microsoft and CompTIA certifications and is a graduate of two IT industry trade schools.

7 comments
Jeterdawg
Jeterdawg

I deplore CAPTCHAS, so I chose not to implement them on my site. On my site, to post any type of comments or make the wiki-like adjustments to data, users must be logged in, so the only place needed to test for spammers is the registration page. My two tests are a randomly-generated question (such as "what color is a stop sign?") from a pool of about 50 random questions, and a timer. This form should take at the very least 2 minutes to fill out, so if it took less, it is assumed to be spam. What if the user doesn't know the answer to one of the questions? Well that is a to weed out brain-dead users, I suppose. If they can't answer the questions, there is no way they will be able to follow instructions for other areas of the site. Those probably won't work for everyone, but if they do, more power to you.

seanferd
seanferd

that's behind a paywall would get his research, or whatever it is, made public. Implemented. I'm thinking it could be implemented in lieu of captcha as well as for authentication. (Based solely on my ignorance and the little bit of information provided in the abstract.) Uh, this was it: Simple Arithmetic for Faster, More Secure Websites

CharlieSpencer
CharlieSpencer

I don't care what the solution is, as long as you can convince Jason or his PTBs to implement one of them around here!

Sterling chip Camden
Sterling chip Camden

"If you really want to help contribute to the development of artificial intelligence, though, you can put your CAPTCHA in a hidden element on the page, so only the scripts will see it." A somewhat draconian approach to comment moderation.

apotheon
apotheon

The rampant absurdity of paywalls for noncommercial research just really chaps my hide.

info
info

You don't think that anyone should be compensated for the time they've put into ANY work? The 'Open Source' / 'Free Information' way is great! Until you're not living in your parent's basement anymore...

apotheon
apotheon

We all know there are other ways to make money than paywalls. You can't be that stupid -- can you?

Editor's Picks