Introduction: the Robots Exclusion Protocol
Twenty years ago the Robots Exclusion Protocol - at the time known as the Robots Exclusion Standard - first specifications were written.
They consisted in a basic description of the /robots.txt file, and some suggestions on how further information could be added in meta tags. After two decades, let's see what went right, what not, and how it could be fixed.
15 little known facts about robots.txt
- Its first specification was written by Martijn Koster in 1994 after the web site he administered was involuntarily DOSed by a badly written bot
- The first robots.txt compliant crawler - CharlieSpider/0.3 - was written by a today popular novelist, Charles Stross, who also is the guy who caused the first verified unintentional DOS attack.
- The very same Kostner around the same year also proposed an alternate solution (a '/site.idx' file in IAFA format) to address problem beyond crawlability and more geared toward indexation. With hindsight that format probably lays the foundations of what became the XML Sitemaps protocol.
- Kostner already was quite known in the niche of robots builder, and cared so much about the dangers of potential misuse of bots to write Guidelines for Robot Writers.
- Even if subsequently written as a formal RFC, robots.txt syntax has never been standardized by any internationally recognized standard organization.
It is a de-facto standard.
- In a document following the original specs, an hypothetical "DontIndex" directive was cited, to be used within a meta tag, (sounds familiar?) and as a robots.txt directive.
- Google actually implemented a (non officially supported) Noindex: directive probably inspired by such document.
- robots.txt original specifications have several loopholes and crawler producers had to make-do. The best well covered publicly available proprietary specification so far is Google's.
- Google implementation on deciding whether to permit access to a resource in case of colliding Disallow and Allow directive is a little naïve; instead of following the specs, it is length-based. When * wildcards are used, the outcome can differ based on the length of the file or directory names.
- Google is one of the few major Search Engines not supporting the Crawl-delay directive, and - leveraging its dominant position - forces webmasters wishing to have some control over googlebot crawl frequency to subscribe and use Google Webmaster Tools. Granted, googlebot on its own normally acts as a low-frequency crawler.
- While several extensions have been generally accepted as part of the standard (Allow, * and $ wildcards path matching, Sitemaps, Crawl-delay), there are other interesting extensions proposed, championed and supported by Yandex:
Host and Clean-param directives.
- When robots.txt was first proposed, Unicode as is known today didn't simply exist. Today the recommended file format is UTF-8
- There are file size limitations imposed by proprietary implementations. While Google limitation to 500KB is probably fair, Yandex 32KB seems to be a little too draconian, because - according to Yandex specs any robots.txt file exceeding the limit are to be considered as an "allow everything"!
- Going a little beyond what the original specifications say, Google tolerates redirects for /robots.txt file; furthermore, it accepts as "full allow" not only an HTTP 404 "Not Found" response code, but also 401 "Unauthorized" and 403 "Forbidden" response codes (thus going opposite the RFC recommendation).
- The danger of badly programmed robots is also taken seriously by Google founders, who try to protect themselves against Terminator robots.
- UPDATE - Bonus Fact: two other interesting extensions previously proposed - Visit-time and Request-rate - are both fully supported by Seznam with its Request-rate extension.
While Request-rate goal partially overlaps with Crawl-delay, the Visit-time part enables webmaster to indicate a preferred time range for the spiders to crawl the site.
Noindex: an extension that should be supported
As previously mentioned, Google already experimentally implemented support for the Noindex: directive.
I'm not sure why the feature has not been pushed. I find it potentially very helpful.
Today to de-index - or prevent indexing a resource - you have to expose it to crawler so it can be downloaded with an HTTP request, so that the search engine could see a noindex in the robots meta tag or in the X-Robots-Tag HTTP header, or a 404/410 HTTP status code.
Crawling is a very inefficient way to achieve such a goal, and a big waste of resources both for the web server, the search engine and the overall network in general.
De-indexing in particular can take long time this way. Google permits to manually de-index individual URLs with GWT if they are robots.txt blocked, but de-indexing thousands of URLs this way is not a viable solution.
Furthermore, any solution based on a SE's Webmasters Tools would work for that search engine only.
Using a Noindex: directive would solve the problems more efficiently, without the crawler to be involved.
UPDATE: Alec Bertram proved Google unofficially honours a Nosnippet: directive, and may be Noarchive: as well.
UPDATE: Google announced they plan to retire all code that handles unsupported and unpublished rules (such as Noindex) on September 1, 2019
Case-sensitive: my personal extension proposal
It's not just a pet peeve of mine, there IS an unsolved problem: path directives in robots.txt are interpreted as case-sensitive, but case-insensitive web servers do exist and are widespread.
There's no point in my opinion saying HTTP specifications say URLs are case sensitive (they actually don't say it strictly), IIS servers are largely adopted: 13.7% of the market share, according to w3techs.com (July 2014 stats).
...and Microsoft IIS web servers have a case-insensitive file system.
This leads to potential problems with the robots.txt file - where paths are supposed to follow the specs and be case-sensitive, as outlined by Enrico Altavilla with his /cat/ example:
In order to block access to a resource, we should block all permutations of upper/lower case letters of its path.
Altavilla disallows cats
In my humble opinion, search engines should already go the extra mile and base case sensitiveness decisions on the web server type (as read from the Server http header, when available):
Not that I'm inviting to do such a thing, but... also external links exist; it would be easy to try causing duplicate content issues by creating links to IIS pages missing canonical tag.
My proposal is to let webmasters tell search engines how to interpret their file system:
Case-sensitive: [true | false] # default value is true
where default value is a case-sensitive file system.
Note: Reading Enrico Altavilla's post comments, you might see Enrico - whom I highly estimate - and I had a minor disagreement on whether a search engine would have to accept the fact a web server could be case-insensitive (he also pointed out one could tweak Windows to have a case-sensitive file system). I don't know what his opinion about my extension proposal will be.
mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course
...I realize I'm acting the devil advocate here, I think MS should have worked differently, but Google could at least try to sort it out.
Robots.txt in Court
While in the Western world the robots.txt file has shown up in courts occasionally, many might be surprised to discover it's in China - a country were competition in Search world is unique and ruthless - where the first "Big-money" case involving a robots.txt file was taken in court.
Search engine giant Baidu back in 2012 filed a lawsuit on October 16, 2012 against its competitor Qihoo 360, for mining Baidu data despite the block on their robots.txt file. Latest accounts report a compensation request of 100 million yuan (about € 11.86 million), without the two parties being able to settle an agreement.
What is in my opinion missing today to stand in court is the ratification of REP as an official standard - IETF, W3C, ISO... - that would be a requirement for the bureaucrats in order to be able to legislate (that apparently has also been Qihoo's defensive line: that robots.txt is was merely industry practice and not regulated by law).
My guess is we will increasingly see cases like this in the future also in much smaller scale. Courts and laws are notoriously not up-to-date with technology, but the robots.txt is there since two decades now, and some judge might catch up.
In the meanwhile, I strongly suggest web bots builders to get ready in advance with full support to the Robots Exclusion Protocol and all its known extensions!
Conclusions: Time to make it a real standard
It's high time the major players formed a committee, wrote an updated RFC, and submitted a serious specification to apply for standardization.