Manual: Crawl XML Sitemap

The feature "Crawl XML Sitemap" of Visual SEO Studio, documented in detail.

Crawl XML Sitemap

This powerful function permits you to audit XML Sitemaps by crawling all their listed URLs.
Sitemaps can be crawled recursively, and are presented nested within the intuitive user interface.
Not only you can crawl normal or index sitemaps, the programs goes a step further and lets you even crawl all the XML Sitemaps listed within a robots.txt file using the Sitemap: directive.

To learn more about the feature, please read the Crawl XML Sitemaps and robots.txt page.

XML Sitemap or robots.txt URL

Insert here the address of the XML Sitemap you want to audit, or of the robots.txt file.

The URLs listed in the XML Sitemaps will be downloaded and shown nested below the Sitemap node.
If you insert the URL of a Index Sitemap, there will be two levels of nesting, with the Index Sitemap at the top, as all XML Sitemaps listed in the Index Sitemap will be downloaded first, and then for each Sitemap its URLs will be downloaded.
Analogously, if you insert the URL of a robots.txt file which uses the Sitemap: directive, there will be three nesting levels.
If you do not specify a protocol (http:// or https://), then http:// will be assumed.

Session Name (optional)

You can give your crawl session an optional descriptive name for your own convenience. You will also be able to add it or change it at a later time.

Show/Hide options

Clicking on the Show options link will expand the window to let you access further crawl parameters.

Use HTTP Authentication

Access to websites under develpment could be restricted via HTTP authentication.
Clicking on the Use HTTP Authentication... button, a window will pop-up to permit configuring the access credentials to use to audit a XML Sitemap of a website with access restricted via HTTP authentication.

Maximum Download Size per URL (KB)

The maximum size tolerated for web pages to download. Pages exceeding that size will be truncated.

A truncated page may impair the crawl process: links in the HTML content subsequent the truncation point would not be found and followed. It is not that rare finding sites which pages are - due to some error in the web server configuration - so bloated of useless stuff before the actual content (e.g. tonnes of scripts and CSS in the HTML head section of some badly configured WP sites, or huge ViewState at the beginning of the HTML body section in badly conceived old ASP.NET WebForms pages) that no links at all can be found before the truncation point. Only the Home Page would be visited and the crawl session would end. This is exactly one of the cases covered in our troubleshooting FAQ.
In such cases you might want to increase the parameter limit.

We recommend keeping the default limit (512 KB) and change it only when really needed.
There are reasons for the default limit to exist:

  • Increasing the limit does also increase the program memory consumption during the crawl process.
    Visual SEO Studio uses memory check-points during the crawl, to prevent crashes to occur due to limitations in the available computer RAM: each few thousand of pages visited, available memory is checked to see if it can accomplish the next task; if available memory is not enough the crawl session ends. We are proud to say that Visual SEO Studio is extremely robust against memory shortages.
    Increasing the page limit size would also increase the minimum amount of existing free memory the program requires, increasing the probability of the crawl process to stop before all pages are visited, if the pages actual size is not really that big.
  • While Google can be very tolerant and is able to download without truncation pages even up to 15-16 MB in size, it doesn't mean that big HTML pages are good for search engines:
    Excessively big pages take long to render in the browser, and are a bad user experience. Search engines tend to demote them in ranking.
    We recommend keeping the default truncation limit so to detect size problems as soon as possible.
    When you need to increase the limit in order to complete a site exploration, we recommend also auditing the page size by using the Performance suggestions feature. When the excessive page size is a common trait among all website pages, it is normally caused by a bad configuration on the server side or in the main template. You fix it there, and you fix it everywhere.
  • A limit, whatever high or low, should exist to avoid so-called spider-traps based on infinite-size downloads, conceived by malicious sites to crash web bots by exhausting their RAM work memory.

That being said, we should also add that users of the free Community Edition are extremely unlikely to ever face memory related problems, since they can only download up to 500 pages/images per crawl session. They can freely increase the limit without having to worry about the issue.
On the contrary, users of higher editions who need to crawl big websites with hundred of thousands of URLs should be more aware of the impact that increasing the limit has on memory consumption. Better fix the size issues first before crawling the entire web site.

Note: the limit only concerns web pages. For image files - when their download is enabled by the proper option - the spider applies a limit of 10 MB (which is really high, never keep web images that heavy!) and for XML Sitemaps the limit is the 50 MB limit of the Sitemap protocol.

Maximum number of concurrent connections

SEO spiders try to speed up website visits by using multiple concurrent HTTP connections, i.e. requesting more web pages at the same time.
Visual SEO Studio does the same, even if its adaptive crawl engine can decide to push less if it detects that web server would get overloaded.
This control lets you tell the spider how much it could push harder if the web server keeps responding fast.

The Visual SEO Studio edition and whether the website is among the Verified Websites can influence the ability of the spider to crawl faster:
For verified sites you can set up to 32 concurrent connection. For non-verified sites, maximum limit is 5.
The free Community Edition can only use 2 concurrent connections at most.
Warning: increasing the number of thread could slow down or hang the server if it cannot keep up with the requests; do it at your own risk (that's why you can force more on verified sites only).