Something Evil About Robots.txt I Didn’t Know

Quick background: A robots.txt file on your website will tell search engines and other bots that obey the robot exclusion standard what files and folders they can and can’t index, or whether they can access the website at all.

I’ve been working on the robots.txt file at work the last few days.* Once the file had the bots I wanted to exclude I decided to run it through a robots.txt validator.

Boy did I learn a few things. It turns out that you should put robot exclusions at the top and directory and file exclusions below. There were also a few minor formatting issues that I’m not sure really mattered.

There was one, however, that was a shock. Let’s say you’ve got a folder called “video”. There’s a huge difference between these two disallow statements:

Disallow: /video/
Disallow: /video

The first example with a trailing slash tells robots not to index anything in the video directory. So far so good. The second example without a trailing slash tells robots not to index anything in the video directory, or any file at the root level with video at the beginning of the filename.

Without the trailing slash, you would exclude /video.html, videoplayer.aspx – you name it. Anything at the same level of the directory structure that begins with video. You can get into trouble in a hurry if you leave the backslash off of the disallow directive.

* What prompted the work was all of the bots that kept showing up in our error files. One of the worst? The Internet Archive Bot that collects pages for the Internet Archive. It would generate hundreds of errors a day. When I looked around at bot ban lists the IA bot showed up over and over. You’d think Internet Archive would  have worked the bugs out of their bot by now.

This entry was posted in Ecommerce, Tech. Bookmark the permalink.

 

 

2 Responses to Something Evil About Robots.txt I Didn’t Know

  1. Hue says:

    You’re wrong. If it worked the way you describe it, it would indeed be extremely dangerous. See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1

  2. Les Jones says:

    About 80% of the way to the bottom of that page there’s a table titled “Example path matches.”

    The examples they give jibe with what I said above:

    PATH:
    /fish
    MATCHES:
    /fish
    /fish.html
    /fish/salmon.html
    /fishheads
    /fishheads/yummy.html
    /fish.php?id=anything
    DOES NOT MATCH:
    /Fish.asp
    /catfish
    /?id=fish

    So I was wrong about one thing. When I said “Without the trailing slash, you would exclude /video.html, viewvideo.html, newvideoviewer.html – you name it. Anything at the same level of the directory structure with video in the name.” In fact, the video example, the word video would have to be at the start of the filename.

    Other than that, it’s correct. I’ve changed that part. Thanks.

    On the flip side, leaving off the trailing slash creates a problem I didn’t realize. If you’re blocking /video, you would also block /videofiles/. So it creates problems with directories, not just files.