During the development of one of our extensions, a member of the development team asked me how to authenticate a Search Engine Bot. The answer could be very easy: by useragent, but what if somebody changes the browser’s useragent? Then I found a nice idea in the Official Google Webmaster Central Blog: use DNS servers to authenticate them.

It looks very simple at the begining, I even found a bare bone code that could work, but since our extensions are used worldwide, we can’t limit the list of search engines to a closed list. Verify every visitor’s IP is worthless and spends excessive resources, so we choose to select which user agents will be verified. Usually spider’s useragents include their domain and once we extract the domain from the useragent we can perform the DNS verification.

Note the following remarkable facts:

  • Not all the useragents that include a domain are spider’s useragents. Bsalsa develops an embedded web browser component package whose useragent includes bsalsa’s domain.
  • Not all the crawlers include a domain name in their user agent.
  • And finally some authentic search engine spiders includes a distinct domain in their useragent than the returned by the reversal lookup of its IP address. Google in examlpe, includes "google.com" in the useragent and the DNS lookup returns "googlebot.com". Something similar happens with some MSN bots that includes "search.msn.com" in their useragent but the reverse DNS lookup resolves to "search.live.com" or to "phx.gbl".

In addition to all above, we have a valuable service from The Project Honeypot aiming to identify spammers and the spambots they use to scrape addresses from websites. The Project Honeypot provides the HTTP Blacklist service, that in addition to spammers’ IPs, it identifies some search engines IP’s, giving us an additional resource to authenticate Search Engine Bots.

With all this elements, we conclude that it's possible to develop an application that authentifies the most of search engine bots, but always will need the human supervision to come to a final decision while in the early implementation.