Code Answer: Robots.txt command

I have a bunch of files at www.example.com/A/B/C/NAME (A,B,C change around, NAME is static) and I basically want to add a command in robots.txt so crawlers don't follow any such links that have NAME at the end.

What's the best command to use in robots.txt for this?

From serverfault $pageUsers[$entry.posts[0].author].name

I see you cross-posted this on Stack Overflow, but I'll put my answer here as well.

You cannot glob in the Disallow line unfortunately, so no wildcards. You would need to have a disallow line for each directory you want to exclude.
```
User-agent: *
Disallow: /A/B/C/NAME/
Disallow: /D/E/F/NAME/
```
It's unfortunate, but the standards is very simplistic and this is how it needs to be done. Also note you must have the trailing / in your disallow line. Here is a fairly good reference for using robots.txt.

From palehorse
To my knowledge there is no pattern matching routine supported by the robots.txt file parsers. In this case you would need to list each of those files with their own Disallow statement.

Keep in mind that listing those files in the robots.txt file will give out a list of those links to anyone who might want to see what you're trying to "hide" from the crawlers, so there may be a security issue if this is sensitive material.

If these links are in HTML served up by your server you can also add a rel="nofollow" to the A tags to those links and it will prevent most crawlers from following the links.

From Justin Scott
As previously mentioned, the robots.txt spec is pretty simple. However, one thing that I've done is create a dynamic script (PHP, Python, whatever) that's simply named "robots.txt" and have it smartly generate the expected, simple structure using the more intelligent logic of the script. You can walk subdirectories, use regular expressions, etc.

You might have to tweak your web server a bit so it executes "robots.txt" as a script rather than just serving up the file contents. Alternatively, you can have a script run via a cron job that regenerates your robots.txt once a night (or however often it needs updating)

From
```
    User-agent: googlebot
    Disallow: /*NAME

    User-Agent: slurp
    Disallow: /*NAME
```
palehorse : Globbing is not allowed in the file so the * does not work. The only reason it works for User-agent is due to the fact that it is handled specifically different in that line.

From $pageUsers[$post.author].name
You cannot glob in the Disallow line unfortunately, so no wildcards. You would need to have a disallow line for each directory you want to exclude.
```
User-agent: *
Disallow: /A/B/C/NAME/
Disallow: /D/E/F/NAME/
```
It's unfortunate, but the standards is very simplistic and this is how it needs to be done. Also note you must have the trailing / in your disallow line. Here is a fairly good reference for using robots.txt.

From palehorse
it cannot be done. there is no official standard for robots.txt, it's really just a convention that various web-crawlers are trying to respect and correctly interpret.

However Googlebot supports wildcards, so you could have section like this:
```
User-agent: Googlebot
Disallow: /*NAME
```
since most web-crawlers won't interpret wildcards correctly and who knows how they interpret it, it's probably safe to isolate this rule just for googlebot but I would assume that by now every large search engine could support it as well as whatever Google does in search becomes de-facto standard.

From lubos hasko
Best documentation I've seen for this is at robotstxt.org.

From Epsilon Prime

Code Answer

Saturday, January 29, 2011

Robots.txt command

0 comments:

Post a Comment

Blog Archive