Manual: Non-crawled items

The feature "Non-crawled items" of Visual SEO Studio, documented in detail.

Non-crawled items

Non crawled items are pages (or images, or other resources) the spider has found indication - for example a link pointing to them - but could not visit them because some directive forbade it. A typical case is when a directive in the robots.txt file prevents its exploration.

Toolbar

Items found

The number of non-crawled items found.

Context menu

Right clicking on a row will pop up a contextual menu:

Non-crawled Items context menu

Context menu command items are:

Go to Referrer URL
Selects in the main view the node related to the "Referrer" URL, i.e. the address where the spider found the link to the resource.
Show blocking robots.txt

Available when the Crawl Status states that the resource was not crawled because of a block in the robots.txt file.
Once clicked the robots.txt file will be selected in the main view, the Content side pane will be activated, and the corresponding line containing the directive in the robots.txt file will be highlighted.
Find referrer link to the URL
Selects the right pane DOM view and there hightlights the HTML node where the spider found the link to the resource.

Column headers

Icon

Shows the non explored to indicate the resource has not been explored.

Prog. #

Indicates the progressive number during the crawler exploration.

Thanks to this progressive number you can get an idea on how a search engine spider would explore your website, a piece of information you should take into account when dealing with Crawl Budget issues, typical of large websites.
For example, you may realize the spider takes exploration paths towards content areas you repute less important compared to the ones you think more strategical; in such case you should intervene on the website link structure.

Note: the crawl progressive number is an approximation:
Visual SEO Studio uses an exploration pattern called Breadth-first, which is demonstrated to be the most efficient in finding important contents in absence of external signals; the actual exploration order can slightly change because of the parallelization used for speed reasons during the crawl process. Using a single crawl thread you could make it strictly repeatable.
Search engines exploration patterns are on their part high asynchronous, and exploration priority is weighted by - in Google case - the resources PageRank which could be inflated by external links.

Crawl Status

Indicates the Crawl Status: whether the resource has been requested, and if not the reason why it wasn't.

URL

Uniform Resource Locator, the resource address.

For a better search engine optimization it is preferable having "friendly" URLs (i.e. URLs anticipating the page content) and not too long.

Authority Name

The combination of protocol, host name and, if different from the default value, port number.

An important piece of information you can see form the Authority Name for example is whether the URL is protected by the secure HTTPS protocol.

It could also be handy having the authority name shown in case of explorations of URL lists or of sites with more sub-domains.

Path (encoded)

The resource path, with URL encoding when required.

Due to a limit of the HTTP protocol, a URL when "running on the wire" can only contain ASCII characters (i.e. Western characters with no diacritics). URL encoding replaces special characters (diacritics, spaces, non-Western alphabet letters, ...) with their Escape sequence.

Many URLs are only composed of ASCII character, and since they do not need encoding, the encoded and decoded version of their path look the same, but let's have a look to an example URL written in Cyrillic:

Path: /о-компании (a typical URL path for a company page, it translates from Russian as /about-company)

Since HTTP protocol cannot convoy non-ASCII characters, in order to permit these human-readable URL paths the characters are encoded by the browser transparently before sending them on the wire to request the resource to a web server, transforming the example path as:

Path (encoded): /%D0%BE-%D0%BA%D0%BE%D0%BC%D0%BF%D0%B0%D0%BD%D0%B8%D0%B8

The encoding used is called percent-encoding
Visual SEO Studio by default shows URLs and Paths in their decoded, human-readable form, but user might want to see the encoded version to investigate URL issues.

Path (decoded)

The resource path (URL decoded, thus in human-readable form).

Blocking directive in robots.txt

When the Crawl Status states that the resource was not crawled because of a block in the robots.txt file, the cell reports the blocking directive of the robots.txt file.
The directive has an active link: once clicked the robots.txt file will be selected in the main view, the Content side pane will be activated, and the corresponding line containing the directive in the robots.txt file will be highlighted.

Referrer URL (decoded)

The complete URL of the resource where the link to present resource was followed (URL decoded).

Crawl paths taken by a bot during a website exploration permit to understand the website link structure.
The Referrer URL is not necessarily the only URL to the resource, just the one Visual SEO Studio spider followed to discover the URL.
You can locate all links to the resource with the context menu entry Find all links to the URL.

Referrer Path

The path of the URL of the resource where the link to present resource was followed.

Referrer Path (decoded)

The path of the URL of the resource where the link to present resource was followed (URL decoded)

Depth

The depth of the page in the site link structure, also known as "link depth", i.e. the number of clicks needed to reach it starting from the Home Page.

Knowing a page depth from the main URL is important because search engines give more or less importance to a page relatively to its distance from the main URL: the closer, the more important it is.
Note: this is a simplification; in the case of Google for example usually the Home Page is the page with greater PageRank (a Google measure to assess the importance of a page, other search engines use similar models), the pages connected with a single link to the Home Page are thus the ones receiving more PageRank.

Furthermore, the greater the distance, the less likely is the page to be reached and explored by the search engine spiders, because of the normally limited Crawl Budget (simplifying: the number of pages a search engine can explore within a certain time slot when visiting a website).

Thus, place the pages you want to give more weight closer to the Home Page.

Link Depth is also important from a user perspective: it would be hard for them to find a content starting from the Home Page if it takes many clicks to reach it.
A common usability rule wanted each page reachable with three clicks or less. This is not always possible in case of very large websites, nevertheless you should choose a link structure that minimizes each page link depth.