Are you writing targeted crawlers? Ever crush a site you’re crawling? Help is here!

Ever piss off the site you were developing the crawler for? If you’ve ever written a targeted crawler, you know you have or could have…

Jobster is developing targeted crawlers for the major employment websites and it looks like they have a great patch for squid that could come in handy for anyone developing targeted crawlers:

Jobster has multiple test and development environments, all crawling the same sites, which would potentially lead to a lot of duplicate traffic against these sites. A caching proxy is ideal for eliminated the duplicated hits against those sites, reducing the load on those sites.

Learn more about Jobsters Squid enhancements.

If you’re never developed a targeted crawler, here are some reasons why it’s a bit difficult and this patch is a “good thing”:

  1. Most sites that you’re targeting probably don’t really want to be crawled, so they might block you if they notice your traffic.
  2. Repeated traffic from one host stands out in the logs, so again… avoid hitting their site too much.
  3. Your target might have sensitive resources, bandwidth, connection, processor. From the targets perspective, a developer writing a crawler looks like a DoS attack. To help remember this, I always imagine the site is a 486 running caudium.
  4. Your target might not always be available. It could be your target only is online while Joe on holds the antenna really still on the roof making their 40 mile wifi connection work.

Leave a Reply