Introduction: sitemaps and generators
The best sitemaps are the ones generated in real-time by the used CMS: they can't be out of date and can generate a correct lastmod attribute. When an integrated one is not available or doesn't fit the current need, and there are occasions when they really don't, you use an external XML Sitemap Generator.
Note: in this instance I will only talk about XML sitemaps as defined by the standard protocol, leaving aside extensions like image or video sitemaps.
How Sitemap Generators work
External sitemap generators work by fully crawling the site and gathering all its public URLs. For a Sitemap generators to be reliable, more conditions are to be met:
- the generator spider have to respect robots exclusion protocol fully (robots.txt, robots meta tag, nofollow attributes in links).
- the generator has to understand all directives sported by the site to resolve duplicate content issues (e.g. canonical URLs, URL normalization, etc...)
- the generator has to generate a correct UTF-8 XML document (surprisingly, some tools fail here because they use simple text handling library to produce them, and for example do not correctly escape XML entities, for example not encoding ampersands into &)
- the generator has to correctly encode the URLs and generate an XML compliant to sitemaps syntax (many tools don't do this either. Despite the fact UTF-8 of course support Unicode characters, all non-ASCII character in the URL path should be percent-encoded; the same holds true for IDN domain names, where the punycode name version has to be used).
These points are all up to the generator, of course.
And yes, tools help a lot. You can sometimes even try to build a sitemap "by hand", but when you it comes to deal with proper encoding and parameters management, the approach would be far from flawless.
- the site needs to have resolved all duplication content issues, or the generator would not have a way to distinguish redundant URLs and would add them all to the sitemap. This is up to the webmaster.
- A well conceived link structure would help the generator to list URLs in better order and ease the attribution of the - optional - priority value.
Visual SEO Studio sitemap generator
When I decided to add a XML Sitemap Editor to Visual SEO Studio I wanted to address all the above mentioned points. I also wanted to give the user full control on what to export and what not.
What it does:
- it's visual, it lets you cherry pick the pages to add from all the crawled pages data set, or add them in bulk, or by directory...
- it correctly encodes and generates valid/well-formed UTF-8 XML sitemap documents, and correctly encodes all Unicode characters in the URLs
- the SEO suite gives you all the needed reports and tools to detect and fix any duplicate content issues
- the spider complies to the robots exclusion protocol
- the spider normalizes URLs avoiding false dupes
- it automatically skips non-indexable URLs
- permits ordering the URLs in crawl order (breadth-first) or alphabetical order
- permits specifying optionally the priority based on the link depth (ease, and makes sense when the site has a clean link structure)
- keeps the sitemaps compact avoiding to explicit default values, and avoiding extra spaces (so it's better to use an XML reader to read them)
- adds a comment at the top of the XML file aimed to human beings, stating how many URLs are listed and when the sitemap file was generated
What it doesn't do (yet):
- still doesn't force a limit to 50,000 URLs and 10MB (as imposed by the protocol specs)
- still doesn't force the URLs to be no more then 2048 chars long (as imposed by the protocol specs)
- still doesn't support sitemap index files (but you very rarely need them, if ever)
- still doesn't support gzip sitemaps (but you can gzip it on your own)
- doesn't add a lastmod value (see reason below)
- doesn't add alternate/hreflang extended info (see reason below)
- doesn't add a changefreq value (because it never is correct without having a crystal ball, and is largely discarded by search engines)
Many generators also add the optional lastmod value; they usually add a false value using the current date.
Visual SEO Studio doesn't add a lastmod for a few reasons:
- seeing consistently a new lastmod date in the sitemaps for non-changing pages could lead a search engine to lose trust in the information
- while the tool could gather the correct dates from the http last-modified headers if present, the information could become out-of-date as the sitemaps are not generated in real-time
- if the information is present the http header, it's no use adding it to the sitemaps: the search engine spider would see it anyway
This last two considerations also apply to the alternate/hreflang info.
I myself have been surprised to discover the Visual SEO Studio sitemap generator was appreciated not only by small site owners, but also administrators of large e-commerce sites, not satisfied by the limited power of the real-time sitemaps their platform offered.
It's not uncommon on SEO forums to find help requests involving badly formed sitemaps. I think developers of sitemap generators should put more effort on making them syntactically correct.
I hope this dissertation would help toward the goal.