Spam, bacon, sausage and blog spam: a JavaScript approach
about 13 years ago
Anyone who has a blog knows about the dirty little spammers, who toil hard to make the Internet a far worse place.
I knew about this issue when I first launched my blog, and quickly wired up akismet as my only line of defence. Over the years I got a steady stream of rejected spam comments with the occasional false-positive and false-negative.
Once a week I would go to the spam tab and comb through the mountains of spam to see if anything was incorrectly detected and approve, then nuke the rest.
Such a waste of time.
##Akismet should never be your only line of protection.
Akismet is a web service that prides itself on the huge amount of blog spam it traps:
It uses all sort of heuristics, machine learning algorithm, [Bayesian inference] 4 and so on, to detect spam.
Every day people around the world are shipping it way over 31 million bits of spam for it to promptly reject. My experience is that the vast majority of comments on my blog were spam. I think this number is so high due to us, programmers, dropping the ball.
Automated methods of spam prevention can solve a large amount of your spam pain.
##Anatomy of a spammer
Currently, the state-of-the-art for the sleaze-ball spammers on the Internet is very similar to what it was 10 years ago.
The motivation is totally unclear, how could a message advertising an indecipherable message be helping anyone make money?
The technique however is crystal clear.
A bunch or Perl/Python/Ruby scripts are running amuck, posting as many messages as possible on as many blogs as possible.
These scripts have been customised to workaround the various protection mechanisms that WordPress implemented and PhpBB implemented. Captcha solvers are wired in, known javascript traps worked around and so on.
However, these primitive programs are yet to run full headless web browsers. This means they have no access to the DOM, and they can not run JavaScript.
##The existence of a full web browser should be your first line of defence
I eliminated virtually all the spam on this blog by adding a trivial bit of protection:
$(function(){
$(".simple_challenge").each(function(){
this.value = this.value.split("").reverse().join("");
});
});
I expect the client to reverse a random string I give it. If it fails to do so, it gets a reCAPTCHA. This is devilishly hard for a bot to cheat without a full JavaScript interpreter and DOM.
Of course if WordPress were to implement this, even as a plugin, it would be worked around using the monstrous evil-spammer-script and added to the list of 7000 hardcoded workarounds in the mega script of ugliness and doom.
My point here is not my trivial spam prevention that rivals FizzBuzz in its delicate complexity.
There are an infinite number of ways you can ensure your users are using a modern web browser. You can ask them to reverse, sort, transpose, truncate, duplicate a string and so on … and so on.
In fact you could generate JavaScript on the server side that runs a random transformation on a string and confirm that happens on the client.
Possibly this could be outsourced. You could force clients to make a JSONP call to 3rd party that shuffles and changes its algorithms on an hourly basis. Then make a call on the server to confirm.
##reCAPTCHA should be your second line of defence
Notice how I said reCAPTCHA, not CAPTCHA. The beauty of the reCAPTCHA system is that it helps make the world a better place by digitising content that our existing OCR systems failed at. This improves the OCR software Google builds, it helps preserve old content, and provides general good. Another huge advantage is that it adapts to the latest advances in OCR and gets harder for the spammers to automatically crack.
Though sometimes it can be a bit too hard for us humans.
CAPTCHA systems on the other hand are a total waste of human effort. Not only are many of the static CAPTCHA systems broken and already hooked up in the ubur-spammer script, your poor users are doing no good solving them.
There are a tiny fraction of users that seem to be obsessed with running JavaScript-less web browsers. Using addons such as NoScript to provide with a much “safer” Internet experience. I totally understand the reasoning, however these users can deal with some extra work. The general population have fully functioning web browsers and never need to hit this line of defence.
##Throttles, IP bans and so on should be your last line of defence
No matter what you do at a big enough scale some bots will attack you and attempt to post the same comment over and over on every post. If the same IP address is going crazy all over your website the best protection is to ban it.
###I am not sure where Akismet fits in
For my tiny blog, it seems, Akismet is not really helping out anymore. I still send it all the comments for validation. Mainly cause that is the way it always was. It has a secondary optional status.
My advice would be, get your other lines of defence up first, then think of possibly wiring up Akismet.
##What happens when the filthy spammers catch up?
Someday, perhaps, the spammers will catch up, get a bunch of sophisticated developers and hack up chromium for the purpose of spamming. I don’t know. When and if this happens we still have another line of defence that is implementable today.
###Headless web browsers can be thwarted
I guess, some day a bunch of “headless” web browsers will be busy ruining the Internet. A huge advantage of the new canvas APIs have, is that we can now confirm pixels are rendered to the screen with the getImageData API. Render a few colors to the screen, read them out and make sure it rendered properly.
Sure, this will trigger a reCAPTCHA for the less modern browsers, but we are probably talking a few years before the attack of the headless web browsers.
And what do we do when this fails?
###Enter “proof of work” algorithms
We could require a second of computation from people who post comments on a blog. It is called a “proof of work” algorithm. Bitcoin uses such an algorithm. The concept is quite simple.
There are plenty of JavaScript implementations of hash functions.
- You hand the client a random string to hash eg:
ABC123
- It appends a random nonce to the string and hashes eg: ABC123!1
- If the hash starts with
000
or some other predefined rule, the client stops. - Otherwise increase the nonce are repeat step 3: eg: ABC123!2
This means you are forcing the client to do a certain amount of computation prior to allowing it to post a comment, this can heavily impact any automated processes busy destroying the Internet. It means they need to run more computers on their quest of doom, which costs them more money.
###There is no substitute for you
Sure, a bunch of people can always run sophisticated attacks that force you to disable comments on your blog. It’s the sad reality. If you abandon your blog for long enough it will fill up with spam.
That said, if we required all people leaving comments have a full working web browser we would drastically reduce the amount of blog spam.
I think a time metric would also work well some JS to say if the user spent less than 30 seconds on the page before posting a comment, they are a bot'.
I like the idea of a bit of JS used to detect browser and a time metric, would be a big candidate for spam protection.