Beyond robots.txt - Chaos and Penguins

For the most part, when you have a web site, you want search engines to find it, but there may be parts you don’t want them to crawl because there are duplicate pages, or scripts that will put a heavy load on the database. There are many legitimate reasons you might want to do this.

That’s where a robots.txt file comes in. There’s an entire specification for robots.txt, documenting what rules the various responsible search spiders will follow.

The problem is, robots.txt is voluntary. Not all the search search spiders will obey it. Right now for example, there’s a whole world of AI crawlers trying to suck down the contents of the entire Internet.

That’s a problem. In addition to this very public blog, I have a private site that I use for sharing photos with family and friends. Even if I don’t post links to it (and I don’t), someone else might, and there are all sorts of other ways the bots could find out.

Now, the right way to do this is to either require a login, or not put the site on the public internet. But I don’t want to manage a bunch of logins (much less make my friends use them), and not putting it online pretty much defeats the purpose of having a website for sharing photos.

So, I need to get creative.

My web host uses the Apache httpd webserver, so in addition to the robots.txt file, I also have an .htaccess file. This is a configuration file that (among other things) customizes how the web server responds to web site traffic. Here’s a segment I’ve added to the .htaccess file on my photo site.

# Activate the rewrite module (mod_rewrite)
RewriteEngine On

# If the user-agent header contains any of
# these strings
RewriteCond %{HTTP_USER_AGENT} bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} python [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Palo Alto" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Go-http-client" [NC,OR]

# Or the URL contains /wp- or /.
RewriteCond %{REQUEST_URI} /wp- [NC, OR]
RewriteCond %{REQUEST_URI} /\.  [NC]

# And the url doesn't start with one of these
RewriteCond %{REQUEST_URI} !^/\.well-known
RewriteCond %{REQUEST_URI} !^/error402\.htm$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/$

# Then send back an HTTP 402 status code.
RewriteRule .* - [R=402,L]

# If I'm sending a 402 status code,
# send this as the page.
ErrorDocument 402 /error402.htm

This is fun little bit of configuration.

This isn’t foolproof. It’s very easy for a web spider to pretend to be Chrome, Firefox, Edge or one of a number of other legitimate browsers. But it’s another layer of protection.

Oh, and about that 402 status code. This is a cousin to the better known 404 error But while a 404 error means “Page Not Found”, a 402 error means “Payment Required.” If Google or Open AI or whoever really wants a photo of my grandparents’ house, I’m willing to share. But not for free.

(Cover photo of a dog holding a “Do Not Enter” sign generated with Bing Create.)