March 6th, 2007 - by ses5909

A few days ago my partner John, noticed that our web sites content was being scraped. It wasn’t all that concerning at the time, but last night he did some keyword searches on Google to check our rankings and noticed that the site with our stolen content actually ranked higher than ours!

This was obviously a problem. So, we immediately needed to figure out what steps to take. John sent a DMCA to Google as I put together a cease and desist letter to send off to the site owner, domain registrar, and their host. During all of this, we were determining the IP address of the site. We did a whois on their domain name which resulted in 3 different IPs and when we pinged their domain we found a forth.

In order to block their domain ranges we added the following to the .htaccess file:

Order Deny,Allow
Deny from 127.0.0.0

This will block access for any user with an address in the 123.123.123.0 to 123.123.123.255 range.

John then thought of a way to use this to our advantage. What if we detected any traffic from their domain and instead of blocking it, we redirect it to our homepage so they become OUR visitors. We created a rewrite condition like:

RewriteCond ${HTTP_REFERER} ^123\.123\.123\.
RewriteRule .? index.php [R=301,L]

This should redirect anyone from their domain to our homepage. Now to just see if it works!

You may also want to stop people from linking to your images, javascript, swf, and css files. This is known as HotLinking, and it cost you bandwidth when they do it. If you would like to prevent HotLinking then add the following to your .htaccess file.

# START Prevent HotLinking
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?search-this.com/.*$ [NC]
RewriteRule \.(gif|jpg|js|css|swf)$ - [F]
# END Prevent HotLinking

This will prevent HotLinking to your gif, jpg, js, css and swf files. Just remember that mod_rewrite should be enabled for this to work.

You may also decide you want to replace a HotLinked image with your own image. To do this add the following to your .htaccess file:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?search-this.com/.*$ [NC]
RewriteRule \.(gif|jpg)$ http://www.search-this.com/images/hotlinked.jpg [R,L]

Now when they link to one of your images it will display the alternate image that you provided.

Hope this helps someone out there…

And finally, if you need to find any merchant account related information… don’t visit the imposter’s, visit the Ultimate Merchant Account Resource.

14 Responses to “Stop Site Scraping”

1 Ben Partch

Hello

Thanks for this. The site in question had scrapped my site also.

This was very helpful. 🙂

Also to anyone who reads this, I notice that the site in question has scrapped many, many sites just like mine and Sara’s/John’s.

Including but not limited to, this site Mark.

Now I am off to do all my other sites. 🙁

2 Golgotha

Hey Ben, ST gets scraped by at least half a dozen sites. It typically doesn’t effect us because our pagerank is higher than the scrapers.

It has become an epidemic though.

3 TOMAS

Thanks for the informative post, we need more posts like this that help us thwart the bad guys and keep our content safe and secure! *hint* *hint*

Also, how did you figure out that the site scraping was going on in the first place? Was it through your visitor tracking software or Technorati?

4 John Conde

I actually found it while doing a routine ranking check for my site. My site is sandboxed (or however you wish to describe it) and appears for no quality keywords. But I check regularly anyway so I can tell when the dark cloud has lifted. 😉

Anyway, one day I noticed I ranked moderately well for some decent search terms. I checked it out and it wasn’t my site but theirs. It’s bad enough I can’t get my site ranked but I don’t need them to rank well from my hard work. I have enough of an uphill battle and I don’t need my site to be seen as duplicate content to a site scraper.

5 Golgotha

Yes Tomas, I found out through tracking software (AWStats) and Technorati. In addition, I have received e-mails from people telling me so and so is scraping your site.

6 John Loch

Is the site in question replicating your merchant site content, or this ones (as in search-this.com) ?

7 ses5909

Both. I wrote the post about my Merchant account site, but Mark is explaining how he has had the same thing happen to him.

8 Dan Schulz

This is just crazy. I know it goes on, but people just need to pull their heads out of wherever they’re shoving them and realize that they can’t get away with this kind of garbage forever.

Thanks for sharing the tip. Now to see if it works. 🙂

9 Karl Groves

I hate to be the bearer of bad news but these methods are not reliable. I decided to test to see if I could scrape this site. Typical methods with PHP and file() or file_get_contents() were blocked and simply returned a string that says “Stop Site Scraping”. However, your methods do nothing against a scraper using Curl. With 8 lines of Curl, I was able to retrieve this site completely. All the scraper would need to do then would their regular processing in order to put your data on their site.

Naturally, I won’t reproduce the code here, but if you know any Curl, its just a simple GET request.

At any rate, good luck. People stealing content suck.

[…] Blog Scraping – have you been a victim? How to protect yourself against Blog Post Theft and Splogs! Top 8 Excuses for Stealing Other People’s Content Six Steps to Prevent Content Theft and Combat Copyright Infringement on Your Business Blog The 6 Steps to Stop Content Theft How to deter thieves from stealing your images and server bandwidth Blog Plagiarism Q&A Stop Site Scraping […]

[…] If you would like to stop people from HotLinking linking to your images, javascript, swf, and CSS files. Just modify your .htaccess file with the following code. Code provided by: Search-This […]

[…] If you have had enough of these people, you can stop them from HotLinking linking to your images, javascript, swf, and CSS files. Just modify your .htaccess file with the following code. Code provided by: Search-This […]

13 Brett Wraight

Check out these guys http://www.scrapestopper.com.

They are able to stop Sites for being scraped and protect a site from all forms of scraping.

They offer a trial period Awesome and there system is so easy use.

In the reports they track down the scrapers for you very cool indeed worth a look at..

14 Plagiarism - George Andrews

[…] someone else and you don’t want them using your original content as their own you can visit this web site for information that will help you stop site scraping. Filed Under: […]

mulberry sale spyder womens jacket cheap new balance 574 mulberry outlet cheap new balance 574 arcteryx outlet mulberry sale spyder womens jacket mulberry sale spyder womens jacket mulberry outlet mulberry outlet new balance 574

Popular Articles

Top 10 Commentators


Subscribe to this feed! Subscribe by Email!

Random Bits Podcast

You need to download the Flash player from Adobe

Blogs Worth Reading