Anyone who has a blog knows about the dirty little spammers, who toil hard to make the Internet a far worse place.

I knew about this issue when I first launched my blog, and quickly wired up akismet as my only line of defence. Over the years I got a steady stream of rejected spam comments with the occasional false-positive and false-negative.

Once a week I would go to the spam tab and comb through the mountains of spam to see if anything was incorrectly detected and approve, then nuke the rest.

Such a waste of time.

##Akismet should never be your only line of protection.

Akismet is a web service that prides itself on the huge amount of blog spam it traps:

spam

It uses all sort of heuristics, machine learning algorithm, [Bayesian inference] 4 and so on, to detect spam.

Every day people around the world are shipping it way over 31 million bits of spam for it to promptly reject. My experience is that the vast majority of comments on my blog were spam. I think this number is so high due to us, programmers, dropping the ball.

Automated methods of spam prevention can solve a large amount of your spam pain.

##Anatomy of a spammer

Currently, the state-of-the-art for the sleaze-ball spammers on the Internet is very similar to what it was 10 years ago.

The motivation is totally unclear, how could a message advertising an indecipherable message be helping anyone make money?

The technique however is crystal clear.

A bunch or Perl/Python/Ruby scripts are running amuck, posting as many messages as possible on as many blogs as possible.

These scripts have been customised to workaround the various protection mechanisms that WordPress implemented and PhpBB implemented. Captcha solvers are wired in, known javascript traps worked around and so on.

However, these primitive programs are yet to run full headless web browsers. This means they have no access to the DOM, and they can not run JavaScript.

##The existence of a full web browser should be your first line of defence

I eliminated virtually all the spam on this blog by adding a trivial bit of protection:

$(function(){
    $(".simple_challenge").each(function(){
      this.value = this.value.split("").reverse().join("");   
    });    
 });

I expect the client to reverse a random string I give it. If it fails to do so, it gets a reCAPTCHA. This is devilishly hard for a bot to cheat without a full JavaScript interpreter and DOM.

Of course if WordPress were to implement this, even as a plugin, it would be worked around using the monstrous evil-spammer-script and added to the list of 7000 hardcoded workarounds in the mega script of ugliness and doom.

My point here is not my trivial spam prevention that rivals FizzBuzz in its delicate complexity.

There are an infinite number of ways you can ensure your users are using a modern web browser. You can ask them to reverse, sort, transpose, truncate, duplicate a string and so on … and so on.

In fact you could generate JavaScript on the server side that runs a random transformation on a string and confirm that happens on the client.

Possibly this could be outsourced. You could force clients to make a JSONP call to 3rd party that shuffles and changes its algorithms on an hourly basis. Then make a call on the server to confirm.

##reCAPTCHA should be your second line of defence

Notice how I said reCAPTCHA, not CAPTCHA. The beauty of the reCAPTCHA system is that it helps make the world a better place by digitising content that our existing OCR systems failed at. This improves the OCR software Google builds, it helps preserve old content, and provides general good. Another huge advantage is that it adapts to the latest advances in OCR and gets harder for the spammers to automatically crack.

and for humans

Though sometimes it can be a bit too hard for us humans.

CAPTCHA systems on the other hand are a total waste of human effort. Not only are many of the static CAPTCHA systems broken and already hooked up in the ubur-spammer script, your poor users are doing no good solving them.

There are a tiny fraction of users that seem to be obsessed with running JavaScript-less web browsers. Using addons such as NoScript to provide with a much “safer” Internet experience. I totally understand the reasoning, however these users can deal with some extra work. The general population have fully functioning web browsers and never need to hit this line of defence.

##Throttles, IP bans and so on should be your last line of defence

No matter what you do at a big enough scale some bots will attack you and attempt to post the same comment over and over on every post. If the same IP address is going crazy all over your website the best protection is to ban it.

###I am not sure where Akismet fits in

For my tiny blog, it seems, Akismet is not really helping out anymore. I still send it all the comments for validation. Mainly cause that is the way it always was. It has a secondary optional status.

My advice would be, get your other lines of defence up first, then think of possibly wiring up Akismet.

##What happens when the filthy spammers catch up?

Someday, perhaps, the spammers will catch up, get a bunch of sophisticated developers and hack up chromium for the purpose of spamming. I don’t know. When and if this happens we still have another line of defence that is implementable today.

###Headless web browsers can be thwarted

I guess, some day a bunch of “headless” web browsers will be busy ruining the Internet. A huge advantage of the new canvas APIs have, is that we can now confirm pixels are rendered to the screen with the getImageData API. Render a few colors to the screen, read them out and make sure it rendered properly.

Sure, this will trigger a reCAPTCHA for the less modern browsers, but we are probably talking a few years before the attack of the headless web browsers.

And what do we do when this fails?

###Enter “proof of work” algorithms

We could require a second of computation from people who post comments on a blog. It is called a “proof of work” algorithm. Bitcoin uses such an algorithm. The concept is quite simple.

There are plenty of JavaScript implementations of hash functions.

  1. You hand the client a random string to hash eg: ABC123
  2. It appends a random nonce to the string and hashes eg: ABC123!1
  3. If the hash starts with 000 or some other predefined rule, the client stops.
  4. Otherwise increase the nonce are repeat step 3: eg: ABC123!2

This means you are forcing the client to do a certain amount of computation prior to allowing it to post a comment, this can heavily impact any automated processes busy destroying the Internet. It means they need to run more computers on their quest of doom, which costs them more money.

###There is no substitute for you

Sure, a bunch of people can always run sophisticated attacks that force you to disable comments on your blog. It’s the sad reality. If you abandon your blog for long enough it will fill up with spam.

That said, if we required all people leaving comments have a full working web browser we would drastically reduce the amount of blog spam.

Comments

Mohammad about 13 years ago
Mohammad

I think a time metric would also work well some JS to say if the user spent less than 30 seconds on the page before posting a comment, they are a bot'.

I like the idea of a bit of JS used to detect browser and a time metric, would be a big candidate for spam protection.

Sam Saffron about 13 years ago
Sam Saffron

I have seen this done before in a few spots, the thing is that submitting a simple number is always going to be trivial for a bot. Performing a computation on a string is orders of magnitude harder for these bots.

Jonas Elfström about 13 years ago
Jonas Elfström

I have almost the exact same experience from running my own tiny blog and I've been planning to implement something very much like this for some time now. Thanks to you I will finally do it. Going through spam and looking for false positives is absolutely a complete waste of time!

an indecipherable message

Last year I got some blog spam that felt like a new category: http://alicebobandmallory.com/articles/2010/05/19/blog-comment-spam-taken-to-the-next-level

Not the usual gibberish.

Sam Saffron about 13 years ago
Sam Saffron

Nasty, there is very little you can do against having a real browser attacking you short of a reCAPTCHA and bans. Sad.

Joss_Crowcroft about 13 years ago
Joss_Crowcroft

“we can now confirm pixels are rendered to the screen with the getImageData API”

I didn't know this – that's awesome and definitely something I'll be taking advantage of with the rewrite of MotionCAPTCHA, where the visitor needs to draw a shape to submit the form (currently all client side, but will be rewritten so that even headless browsers can't too easily solve it..)

Jono about 13 years ago
Jono

Nice approach. One idea I had was not to simply try decide if a comment was spam or not-spam, but have a middle ground, a “we're not sure” state (after some basic checks to weed out blatantly good or bad comments).

If you get a comment that you're unsure of, perhaps ask the user to fill in a CAPTCHA (or some other client-side challenge). Or follow their links to see if those are spammy, or send to Askimet etc

Could also be an idea to look at user interaction, see how long they spend on your site, whether they act “normally” when posting a comment (e.g. using JS to monitor mouse movement/key presses etc).

I've had issues with manual spammers, these idiots who actually manually type in spam, tricky to beat those guys 100% of the time! I blogged a bit about this recently (see my profile link)…

Sam Saffron about 13 years ago
Sam Saffron

Fighting manual spam is really damn hard, you could “greylist” an IP range for such a case and always manually approve comments from that range.

Luckily I have not been attacked by that here yet.

Lulalala about 13 years ago
Lulalala

Another common way to stop this is to have hidden trap fields. Name your url input field with some random characters like “afggj” and have a css-hidden field called “url” next to it. Spambot will know to fill in the “url” field but does not know it is a trap.

Sam Saffron about 13 years ago
Sam Saffron

someone mentioned this on twitter, another person complained that chrome auto-complete bit them here. Don’t think this is as strong as the JavaScript test though.

Lulalala about 13 years ago
Lulalala

true~ so any plan on making this a wordpress plugin? :P

Haacked about 13 years ago
Haacked

I wrote about a similar approach a few years back. I called it “Invisible Captcha”. http://haacked.com/archive/2006/09/26/Lightweight_Invisible_CAPTCHA_Validator_Control.aspx

I combined that with Honeypot Captcha: http://haacked.com/archive/2007/09/11/honeypot-captcha.aspx

Problem for me now is I still get a lot of non-automated SPAM :)

Sam Saffron about 13 years ago
Sam Saffron

yeah … non-automated SPAM is a royal PITA

Seovalencia almost 13 years ago
Seovalencia

It is better not make a plugin. If the plugin succeeds many hackers will study your code to break it

Bjkeefe almost 13 years ago
Bjkeefe

Interesting post. Thanks.

Mostly, just wanted to see the challenge in action. Please feel free to delete this comment.

Sam Saffron almost 13 years ago
Sam Saffron

ha … :slight_smile:

Matt almost 13 years ago
Matt

Just a thought. You know the way PuttyGen uses mouse movement to create real randomness? Is it possible to tell the difference between real randomness and computer generated randomness? If so could that also be used against a headless browser?

Sam Saffron almost 13 years ago
Sam Saffron

I doubt you would need to go that far, the trick with reading pixels from the screen would pretty much kill every bot out there. My current primitive trick has kept my blog spam free for quite a few months now. Only issue I had was last Friday when Akismet decided to mark a few comments as spam that were not.

Jonas Elfström over 12 years ago
Jonas Elfström

Last night I added a JavaScript to my blog that randomly shows two images of two single digit numbers. It then asks the commenter to add those numbers. Three hours later I got five spam comments. My conclusion is that it's human entered spam because otherwise it's the most advanced spam bot I've ever heard of!

Sam Saffron over 12 years ago
Sam Saffron

Phil Haack also says he still gets a ton of spam despite JS tricks, which very much surprises me. 4 of 5 months in and I got a total of 4 spam messages on my blog. Very strange.

Ardham_Janine over 12 years ago
Ardham_Janine

I've had problems with spam for as long as I can remember and akismet helped… just a bit. Now I'm working on adding captcha. Hoping this would drastically reduce the amount of breakfast spam filtering through my website! I'm looking forward to developers coming up with advanced ways on combating the dreaded trash.

Sam Saffron over 12 years ago
Sam Saffron

wow … akismet shipped this comment to the spam folder … so odd

Sveder over 12 years ago
Sveder

Great blog post. It has creative ideas and interesting techniques. I recently blogged about this subject and then got into a discussion with a friend about headless browsers. I just want to say that even though all your solutions are nice they are still easily broken, for example: 1. The getImageData function only returns what the browser tells it to return. Change chromium to return a computed value or just let them run headfull (is that a word?) and it is broken. 2. Computations are a nice way to waste a lot of cpu cycles but spammers don't use their own machines and the amount of hackable hardware is just growing and growing.

I'll take it even further if people take your advice and switch to these checks making sure there is a full web browsers spammers will adapt almost immediately thus making your prediction that they are years away ironically wrong. The only real perfect solution to spam will always be on the server.

Alex almost 12 years ago
Alex

As mentioned above whatever becomes popular becomes a target. For that reason I would avoid reCAPCHA. There are solving services for it that turn it into a real visitor annoyance especially for those of us who don't see/see well and a minor speed bump for the spammer. Some of the puzzle CAPCHAs are at least fun and as far as I can tell not be gamed the degree that reCAPCHA is.

Silwin_Pereira over 11 years ago
Silwin_Pereira

With Akismet i still get some spam through. I dont think Akismet is 100% spam free but it does get 99% of the sh*t out before it hits you.

Sam Saffron about 11 years ago
Sam Saffron

In general I find that the human spam comes from certain IP ranges, at Discourse for high abuse subnets we will just ban them.

I wonder how much of your human entered spam is isolated to a class C network.

Jonas Elfström about 11 years ago
Jonas Elfström

As you most probably already are aware of there seems to be some encoding issues in the comments. Apostrophe becomes ’ and Elfström becomes Elfstr_M.

Sam Saffron about 11 years ago
Sam Saffron

Yeah, its me being punished for upgrading mysql one too many times through too many iterations of Ruby… fallout. Will be fixing most of it, but will take time …

Going forward the blog is at  peace  with  unicode

Caue Rego about 9 years ago
Caue Rego

How’s it going now, 4 years into the future? :stuck_out_tongue:

Your latest link on your blog post comments section is now broken. But I bet you have more insights about it.

And, just for the sake of it, I’ll try posting 3 links to test the challenge here, one of which is my latest blog post which could easily be taken as a sign for spam:

http://talk.cregox.com/t/what-does-public-sex-and-smartphones-have-in-common/7689

Why just showing 1 link? Well, because discourse defaults new users to 2 links at most and I couldn’t post even 2 without getting warned / blocked. :stuck_out_tongue_winking_eye:

Jonas Elfström about 9 years ago
Jonas Elfström

My blog is on hiatus since about three years and I’ve closed the comment section. So I’m sorry to say that know nothing about the development of blog commenting spam over the last couple of years.


comments powered by Discourse