Google announcements about robots.txt
The Good, the Bad and the Ugly
Conclusions
The news: Google announcements about robots.txt
In 2014 we wrote an article - The robots.txt File, Twenty Years Later - where we ended with the sentence Time to make it a real standard - It's high time the major players formed a committee, wrote an updated RFC, and submitted a serious specification to apply for standardization
.
Five years later, the wish comes true!
Google has announced they partnered with Martijn Koster - the original inventor of the protocol - and other search engines to make an official RFC (Request For Comments) draft to transform what is today a de-facto standard into a ratified official standard.
The news do not end here:
Google also announced they open sourced the code of their robots.txt parser.
A detail from the robots.txt file on visual-seo.com
The Good, the Bad and the Ugly
The change is mostly a good thing. Please read our old article to know all about robots.txt and its shortcomings.
The good
- A standardized protocol with a formal specification means that search engines do not have to sort out things on their own the specifications lack details: they are supposed to point out where the specifications are poor, and have them extended.
The current RFC covers parts that were originally left to interpretation: what to do with unexpected HTTP status codes, how to handle redirects, when to use cached results is justified, what size limit to consider, how to deal with Unicode characters, BOM, etc... - Legal claims for robots.txt infringments can now stand on slightly more solid grounds (well, "slightly" we said).
- It's an RFC, and despite what someone says, everyone should be able to make proposals on it. We will try to push our own proposal about case-sensitivity of the directive paths (i.e. keep them case sensitive by default, but permit to specify if they are case-insensitive) to fix the mess done by IIS and other case insensitive web servers.
The Bad
- While praiseworthy, Google's move seems to be made to impose their own interpretation of the gaps in the original robots.txt specifications.
For example, the way redirections are to be handled is clearly Google way to deal with them, and in my humble opinion very subjective. My guess is they are stated to exactly match their implementation rather than for standardization sake. - The new RFC states that any HTTP 40x status codes implies that a crawler can be free to access any resource on the web server. This is exactly today's Googlebot behaviour which we already contested in the past: it forcibly also includes HTTP 401 "Unauthorized" and HTTP 403 "Forbidden" response codes which are clearly meant for the contrary!
- There is almost nothing in the current RFC to describe robots.txt extensions subsequent the initial specification.
The only things covered are the Allow directive and the wildcards $ and *. For the rest, there is a reference to the "Sitemap" directive, explaining how the format is extensible so that implementers could decide to support non-standard directives.
Of course, you will not find references to the "Crawl-Delay" directive, as notoriously Google does not support it. - Google is planning to drop the Noindex directive, which albeit non-standard and kept hidden, was indeed cool and helpful, saving tonnes of HTTP requests to de-index pages. As we already said, the Noindex directive should be pushed in the new standard.
The Ugly
- The specification imposes Google way to deal with conflicting Allow/Disallow directives, with the "longest match" rule. It has a serious shortcoming when you are using the * wildcard or when some parts of the paths to be compared are using percent-encoding, then the length in characters becomes in practice random.
Using a rule based on the order of the directives like other search engines do would have solved all ambiguities.
Conclusions
Google declared they wrote down the RFC draft in collaboration with other search engines. Three days after they announcement, we noticed no mention of such collaboration in other search engines' blogs. The fact the draft reflects so much about Google implementation makes us wonder what is the level of involvement of Bing, Yandex, Baidu, and the rest of the crowd.
An RFC is a Request for Comments, all stakeholders and interested parties should partecipate to make it shine.
The dice was thrown, now let's make it roll!