Unicode URLs & Tools, State of the Art

URLs are for humans too, but what's the State of The Art in Search Tools support for Unicode paths and International Domain Names?

Introduction: URLs Past and Present

A fully Unicode URL Once upon a time a Uniform Resource Locator was just a way to point a resource on the web.
English was the language used in IT world, web server paths were mapping old style file systems, and ASCII characters were more than enough to represent URLs.

When Web crossed the US borders, people had to figure out how to write URLs in their own language.

For the majority of European countries people could make do with conventions still in use: "flatten" their language giving up all accents and diacritics. That works fine in languages with Latin alphabet like Italian or Spanish.

Some languages – like Russian, based on the Cyrillic alphabet – had to resort to a transliteration to the Latin alphabet, also called Romanization

Other languages were even more unfortunate, they simply can not be represented with Latin characters.
Those users had to give up using their own language, or find a way to fit it in URLs.

How to fit non-ASCII characters in the original URL specification?
Extending the HTTP and related protocols to support extended character sets (or later Unicode) was out of questions, it wouldn't have guaranteed a smooth transition.
People had to resort to some kind of encoding, Percent Encoding the one which succeeded.

Percent encoding kind of worked:
It complied the URL specs, thus was channeled through the HTTP protocol, but there wasn't much other pros to say about it, and the cons started:

It was not human readable. That was the entire point they introduced it!
OK, browsers where able to accept Unicode URL paths, and automatically encode them before performing the HTTP request (i.e. before requiring a web page), but the URL shown in the browser address bar was unreadable.
Search engines started using the URL path as a ranking factor. User-friendly URLs started to also be SEO-friendly URLs, but that at least initially worked for western languages only.

For example, the following two URLs offer quite a different user experience:

Unicode path, Human-Readable and URL-Encoded

To complicate matters even worse, Internationalized Domain Names come, and now even the domain name part of an URL have Unicode characters.

How to fit them in the URL RFCs specs? Percent Encoding was of no use here, as the percent symbol was not allowed in DNS specification, so a new encoding was invented to express Unicode names into valid domain names.
Again, it was up the browser handling the conversion transparently.

As far as those working with Search Engine Optimization are concerned, there are three type of tools involved:

Search Engines have to recognize Unicode path, extract from them keyword related information, show them in a human readable format when listing them in the SERP, and let users search for them.
Browsers have to accept a Unicode URL, encode it correctly before communicating with the web servers, and show it in the address bar in a human readable format.
SEO Tools have to handle Unicode URLs and show them in a human-readable format so that SEO professionals working with non-Western languages can have a clue about what URLs they are working on.
I will consider two categories of SEO Tools: Back links Intelligence Providers and Off-line SEO Spiders.

As a test case, I used a web site I have no affiliation with, chosen among many for the sole reason it both has a IDN domain and uses Percent-encoded paths. Its URLs are fully made of Unicode characters.

экскурсии.рф is a tourist excursions site

Search Engines and Unicode URLs

Other than the obvious Google and Bing, I added Yandex among the racers.
I know there are several other search engines out there, Baidu the most prominent, but also Seznam, Nazer... I chose Yandex for two reasons:
It operates in Russia where the most used IDN domain (.рф) is used, and because it is expanding and I think we in western countries will increasingly deal with it in the near future.

Unicode URLs and Search Engines

As expected, all Search Engines show URLs decoded, in a human-readable format.
Unfortunately Bing failed to retrieve the site pages searching the domain name with the site: operator. It only worked using the encoded version, something a normal users don't do.
For a moment I thought it worked searching for the domain name without the site: operator, but then realized it only did it because the domain name was present on every page title.

Winners: Google, Yandex
Losers: Bing

Disclaimer: this comparison only takes into account support of Unicode URLs, nothing about the pertinence, number and quality of results retrieved or any other services the three search engines provide.

Browsers and Unicode URLs

Nowadays all major modern browsers can handle Percent-encoded paths smoothly and show them the way a users could read it.

Unfortunately, IDN support is still not universal. While able to automatically encode them, both the two most diffused browsers – Chrome and IE – disappointingly fail to show IDN domains in a human readable format.

Unicode URLs browsers support

Worth noticing, Yandex.Browser goes a step further in usability:
Not only it correctly represents both IDN domains and percent encoded paths, but as soon as a title is available, it uses it to replace the path in the address bar. I find it a good choice: often path are not user-friendly, and long titles do not fit in the limited width of the tabs.

Winners: Yandex.Browser (with a special mention), Firefox, Safari
Losers: Chrome, IE

Disclaimer: this comparison only takes into account support of Unicode URLs, nothing about the support of other standards, the rendering speed, or any other feature the five browsers provide.

SEO Tools: back links intelligence providers

Racers are the usual suspects here: Open Site Explorer, MajesticSEO and aHrefs. They all offer free accounts, which have been sufficient to evaluate their support to Unicode URLs.
All the three of them accepted an IDN domain and have been to retrieve backlinks pointing to it, but the way results are presented differ.

Open Site Explorer is the first I tried, and seems to be the worst:

Open Site Explorer and Unicode URLs

not only it shows the URLs fully encoded both in the domain and path parts, but ..does it mistakes the encoding for a redirect to an URL it cannot represent?

Also MajesticSEO is unable to properly show the sites URLs properly encoded

MajesticSEO and Unicode URLs

aHrefs fully supports both IDN domains and percent-encoded path, a clear winner for those having to work with non-western sites when assessing a link profile.

aHrefs and Unicode URLs

Winner: aHrefs
Losers: MajesticSEO, Open Site Explore

Disclaimer: this comparison only takes into account support of Unicode URLs, nothing about the number and quality of back links retrieved or any other services the three tools provide.

SEO Tools: off-line SEO Spiders

There are many kind of SEO Tools, but in this instance I will only take into account off-line SEO spiders used by SEO professionals to make SEO audits.

Again, I’m using freely available tools: Xenu's Link Sleuth, Screaming Frog SEO Spider and Visual SEO Studio (disclaimer: for those who still don't know it, I'm the author of the latter).

The first tested tool, Xenu rel. 1.3.8, cannot even crawl an IDN web site:

Xenu with an IDN site in input

It accepts the input value, but..

Xenu output for an IDN address

You have to use the puny-coded URL version in order to crawl it:

Xenu with a punycode domain in input

The crawl result is not that bad: considering the program is dated back 2010. IDN names are of course not handled (they actually appeared on May 2010), but paths are shown after an URL-decode so that they are human readable:

Xenu output for a punycoded site address

The second player is Screaming Frog rel. 2.20
It accepts an IDN domain name in input:

Screaming Frog with an IDN site in input

and encodes it before crawling. Weirdly, its developer didn't take the extra step to actually show it encoded:

Screaming Frog output for an IDN site

The other player, Visual SEO Studio 0.8.7, is the clear winner.

Visual SEO Studio fully supports IDN sites and Unicode paths

No surprise here: Visual SEO Studio has been conceived, planned, engineered to be fully International. Not only can crawl and show IDN names and shows human readable percent-encoded paths, its support for Unicode URLs is thorough: they are shown on all views and reporting tools, and also the percent-encoded versions are available in the pages properties and on all grid view with a dedicated – by default invisible – column.

Visual SEO Studio introduced human-readable URL paths with rel. 0.8.5 on 6 April 2013, and Internationalized Domain Names support with rel. 0.8.6 on 1 May 2013

Back then, Screaming Frog wasn't able to crawl IDN sites, and showed unreadable percent-encoded paths. Today it caught up with encoded paths, and while still doesn't show user-friendly IDN domains, it can crawl them without problem. In practice it already passed Xenu, my guess is it will not take long before it will show IDN sites in a human readable format too.

This is good news for non-western countries users who can choose the best tool for their needs.

Winner: Visual SEO Studio
Losers: Xenu, Screaming Frog (a little better than Xenu)

Disclaimer: this comparison only takes into account support of Unicode URLs, nothing about the quality or features offered by the three SEO Spiders.

Conclusions: Unicode URL, present and future

I will never get tired to remark it: URLs are for humans too, and Unicode in URLs is here to stay.

Web users don't care about technical issues, they don't care about search engines indexing issues. They expect everything to work out of the box, use their own language and have tools deal with it. URLs included.

Search Engines and Browsers are evolving to satisfy such expectation. SEO tools are a little behind, but are catching up.
I hope this article to be a spur to accelerate toward complete support of Unicode URLs by all the tools involved, to make the web even more accessible.

I think today so-called common "best practices" for SEO-friendly URLs will have to be reviewed soon.

My guess is we will stop avoiding accents and diacritics in Latin languages URLs, and will more and more get used working with Unicode URLs.

I still expect avoiding blank spaces as a best practice though: URLs are often copied as text in e-mails and we leverage the e-mail client to properly generate a link on them, spaces – having the URLs not percent-encoded – would prevent the programs to properly generate the links.

And you, did you ever have to deal with Unicode in URLs? Were you satisfied about the tools of the trade?

Federico Sasso

Thursday, July 18, 2013

Tagged as: articles, search-engines, tools

Did you like the article?
Then please show your appreciation by sharing it:

(no cookies are used for these share buttons)