Comment Spam War

March 22, 2014

spam5.jpgOn a light day, I get approximately 22 requests per hour, 24 hours a day, to add spam comments to this site. Sometimes the same IP pops up for spans at a time, sometimes it's from hundreds of different IP addresses throughout the day - most of them likely spoofed.

For the last decade, it's been a continuous side project to prevent this from happening with the least amount of intervention by me, without using a third party service, and the least amount of hurdles for somebody to actually make a legitimate comment. To this day, I have yet to nail down a perfect solution, but what I do to prevent it and what I've learned may be somewhat useful to others experimenting with the same problem.

Here's an hour snippet of my logs on a typical day, with actual IP addresses masked.
2014-03-19 11:51:49 Comment blacklisted from 175.44.X.X IP(175.44.X.X)
2014-03-19 11:50:09 Comment CAPTCHA code does not match. IP(31.41.X.X)
2014-03-19 11:31:50 Comment blacklisted from 112.111.X.X IP(112.111.X.X)
2014-03-19 11:24:03 Comment blacklisted from 112.111.X.X IP(112.111.X.X)
2014-03-19 11:19:16 Comment CAPTCHA code does not match. IP(146.0.X.X)
2014-03-19 11:18:15 Comment CAPTCHA code does not match. IP(137.175.X.X)
2014-03-19 11:15:50 Comment CAPTCHA code does not match. IP(137.175.X.X)
2014-03-19 11:15:47 Comment CAPTCHA code does not match. IP(137.175.X.X)
2014-03-19 11:15:34 Comment blacklisted from 175.44.X.X IP(175.44.X.X)
2014-03-19 11:13:08 Comment CAPTCHA code does not match. IP(91.207.X.X)
2014-03-19 11:07:15 Comment blacklisted from 175.44.X.X IP(175.44.X.X)
2014-03-19 11:07:06 Comment blacklisted from 175.44.X.X IP(175.44.X.X)
2014-03-19 11:00:22 Comment blacklisted from 112.5.X.X IP(112.5.X.X)
2014-03-19 10:59:49 Comment CAPTCHA code does not match. IP(199.15.X.X)
2014-03-19 10:54:15 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:14 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:13 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:12 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:10 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:10 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:07 Comment CAPTCHA code does not match. IP(216.151.X.X)
2014-03-19 10:54:05 Comment CAPTCHA code does not match. IP(216.151.X.X)

I have not gone to great lengths to gather a lot of data about spam comments, but I do gather some data. It became obvious after awhile that the "algorithm" used to post spam comments does change. Every time I put something new in place to prevent spam comments, they will completely stop for awhile. Then, all the sudden they will come back- usually trickle back. This is basically a constant cycle. Letting comment spam through even momentarily makes your website an indefinite target.

Here are some examples of what I do, but these measures are constantly changing.

Check Referer

When a comment is posted, I check the referer. It must be coming from this website and as an extra check, it must be coming from the right page.

$splitup = parse_url($_SERVER['HTTP_REFERER']);
 
if (strtolower($splitup['host']) != strtolower($config['domain']))
{
   return false;
}

You can also do this in your .htaccess with something like this:

<IfModule mod_rewrite.c>
   RewriteEngine On
   RewriteCond %{REQUEST_METHOD} POST
   RewriteCond %{REQUEST_URI} .comment.php*
   RewriteCond %{HTTP_REFERER} !.*yourdomainname.* [OR]
   RewriteCond %{HTTP_USER_AGENT} ^$
   RewriteRule (.*) ^http://%{REMOTE_ADDR}/$ [R=301,L]
</IfModule>

Blacklist

I have a blacklist of IP addresses that are not allowed to post comments. The blacklist was initially seeded with all address blocks from China. Those IP addresses accounted for 99% of spam comments to this site. I manually add IP addresses to this regularly if I see repeat offenders.

Here's the basic functionality to check if an IP address is blacklisted according to the CSV. You could just as well load the CSV into a database.

class Comment
{
   public static function blacklisted($ip)
   {
      if (($handle = fopen(dirname(__FILE__).'/blacklist.csv', 'r')) !== FALSE) 
      {
	 while (($data = fgetcsv($handle, 100)) !== FALSE) 
	 {
	    if (Comment::ip_in_range($data[0],$data[1],$ip))
	    {
	       return true;
	    }
	 }
	 fclose($handle);
      }
      else
      {
	 trigger_error("blacklist.csv not found");
      }
 
      return false;
   }
 
   private static function ip_in_range($start,$end,$ip)
   {
      $s = ip2long($start);
      $e = ip2long($end);
      $i = ip2long($ip);
      if ($s !== false && $e !== false && $i !== false)
      {
	 return ($i <= $e && $s <= $i);
      }
      return false;
   }
}
 
if (Comment::blacklisted($_SERVER['REMOTE_ADDR']))
{
   return false;
}

CAPTCHA

Entering a CAPTCHA code from an image is require to post a comment. I know this works because this alone causes posting comments to fail most of the time. However, it definitely adds a barrier to legitimate comment posters. You could use something like this, but there are many solutions.

Limit Number of URLs in a Comment

Surprisingly, this was the latest major update and it worked wonderfully for several months. Not a single spam comment until they figured it out. I limited the number of URL's in a comment to 3.

Here's an example for just checking http:// links.

if (substr_count($_POST['cmtcomment'],"http://") > 3)
{
   $notice = "Hey that's a lot of links in your comment.  You're not a spammer are you?  Remove some links from <a href=\"#comment\">your comment</a> to prove it.";
   return false;
}

Honeypot Fields

The idea of a honeypot field is great and it works really well. I add hidden fields to the comment form and if any of those fields is populated, the comment is rejected. This flag gets hit a lot which means that the comment spam is mostly automated.

Other Ideas

One idea i've had is to check the amount of time between visiting the page and posting a comment. For example, if that happens within 5 seconds, there's no way a human visited the page, went to the bottom of it, and wrote a comment all within 1 second.

Another idea would be to go to a more complex form of CAPTCHA that requires a minimal amount of thinking, but hard for AI. An example question would be: "What country borders the United States to the north?"

Related Posts