Beyond robots.txt

For the most part, when you have a web site, you want search engines to find it, but there may be parts you don’t want them to crawl because there are duplicate pages, or scripts that will put a heavy load on the database. There are many legitimate reasons you might want to do this.

That’s where a robots.txt file comes in. There’s an entire specification for robots.txt, documenting what rules the various responsible search spiders will follow.

The problem is, robots.txt is voluntary. Not all the search search spiders will obey it. Right now for example, there’s a whole world of AI crawlers trying to suck down the contents of the entire Internet.

Continue reading Beyond robots.txt