Manual: Crawl Session

The feature "Crawl Session" of Visual SEO Studio, documented in detail.

Crawl Session

A Crawl Session consists of the data set resulting from the spider visiting a web site.
This panel details all main properties of the crawl session, and the used crawl options.
If the crawl session is loaded in memory, you can also see a visual summary with graphs in several tab sheets.
If the crawl session is currently running, the main property values are updated in real-time.

Main properties

ID

Each crawl session is identified uniquely by an auto-assigned ID progressive number.

Start URL

The address from where the spider started visiting the website. You will typically insert at the start of a new exploration the website Home Page, usually the "root" address. For explorations of list of URLs the field is not populated.

Session name

You can give your sessions an optional descriptive name. The name can be assigned when choosing the crawl parameters, or at a later time.

Items

The total number of elements managed in the crawl session. This includes pages, images and other resources, along with HTTP redirects and non-crawled items.

Pages

The number of web pages visited during the crawl session. Just web pages, it does not count HTTP requests to images or other resources.

HTTP requests

The total number of HTTP calls performed during the crawl process.

Crawl start

Date and time when the crawl session was launched.

Duration

The time it took to complete the crawl session. When the crawl session is still in progress, this field is not populated.

Completion reason

The reason why the crawl session completed. Normally we expect it to be completed because all links found were visited, but it could have been stopped by the user, or it could have stopped due to other reasons.

Crawl type

The type of exploration. The first times you will normally perform just explorations of type "Link Search", i.e. you will use the spider to explore a website starting from the root address and explore it by following all links found. With more advanced uses you might also audit XML Sitemaps, or explore lists of URLs.

Shown subset

This field is only visible when the main view holds the result of a filtering. It could be having opened in Tabular View the result set of any site analysis report, or the outcome of a Custom Filter.
It reports the number of shown pages, the percentage of page over the total number within the session, and the name of the filter applied.

Session summary

Show/Hide crawl options

Only available when the crawl session is not loaded in memory, the Show crawl options links permits to switch visibility to a grid detailing all crawl options set for the crawl process.
When the crawl session is loaded in memory, crawl options grid is available in a dedicated tab below the main crawl session fields.

Status Codes

A pie chart breaking down the HTTP response codes received from the web server for the whole crawl session.

Response codes can be summarized in five standard classes:

1xx Informative response – request was received and its processing is going on (it is very unlikely you will ever see a 1xx response code)
2xx Success – request was received successfully, understood, accepted and served (it is the response code you normally want to see).
3xx Redirection – the requested resource is no longer at the address used
4xx Client Error – request has a syntax error or cannot be honored
5xx Server Error – web server were unable to honor an apparently valid request

Some very common answers are for example 200 (OK - the standard response for HTTP requests successfully served), 301 (Moved Permanently - used when a page URL is changed and you don't want to "break" external links to the old URL nor you want to lose the page indexation on search engines and want to preserve its PageRank.

(Redirect) do work as follows: when an old URL is requested, the web server answers the client (a browser, or a search engine spider) with a HTTP code 3xx to report the address has changed, and adding in the HTTP header the new address. The browser will then have to request with a new HTTP call the resource to the new address, and in case of permanent redirect could remember for the future the redirection in order to avoid making a double call when the link to the old address will be clicked again.

Redirects can be implemented on the server side using several methods, depending on the used technology and the platform the web server is running on. For example by configuring the .htaccess file on Apache web servers with generic or specific rules; or with dedicated plugins in a WordPress installation; or in case of web sites in ASP.NET technology with rules expressed in the web.config file, or directives set in the single page, or in the logic of the used CMS engine.

Having redirects is not an error per-se, but if they are detected - as it normally happens - during a normal site crawl navigating internal links, it is sign that such internal links were not updated after the URLs change. It is recommended to update the internal links with the new URLs in order not to slow down user navigation experience and not to waste the crawl budget allotted by the search engine.

Particular attention should be given to the 4xx response codes, which Visual SEO Studio rightly reports as errors.
The 4xx codes you will stumble upon are usually 404 (Resource not found) and the nearly identical (Resource no longer existing). Their presence is symptom of a broken link that should be corrected, because user and search engine can not reach the link destination page.

5xx response codes are errors occurred on the web server when it was trying to build the resource to return to the browser or the spider.
They could be a temporary issue, but they should normally not ignored, better reporting them to the developer and investigate on the server side. 5xx errors are a very bad user experience, make visitors abandon the website, and potentially can cause de-indexation by the search engines if repeated over time.

For a more in-depth description of HTTP response codes you can consult the following page on Wikipedia: HTTP status codes

Link Depth

A histogram illustrating the "link depth" distribution over all the website pages (assuming the crawl process starteded from the Home Page).
Link depth, also known as "crawl depth", is the depth of the page in the site link structure, i.e. the number of clicks needed to reach it starting from the Home Page.

Knowing a page depth from the main URL is important because search engines give more or less importance to a page relatively to its distance from the main URL: the closer, the more important it is.
Note: this is a simplification; in the case of Google for example usually the Home Page is the page with greater PageRank (a Google measure to assess the importance of a page, other search engines use similar models), the pages connected with a single link to the Home Page are thus the ones receiving more PageRank.

Furthermore, the greater the distance, the less likely is the page to be reached and explored by the search engine spiders, because of the normally limited Crawl Budget (simplifying: the number of pages a search engine can explore within a certain time slot when visiting a website).

Thus, place the pages you want to give more weight closer to the Home Page.

Link Depth is also important from a user perspective: it would be hard for them to find a content starting from the Home Page if it takes many clicks to reach it.
A common usability rule wanted each page reachable with three clicks or less. This is not always possible in case of very large websites, nevertheless you should choose a link structure that minimizes each page link depth.

Link Depth Mean

The mean value of the link depth of all website pages.

Link Depth Median

The median value of the link depth of all website pages.

Download Time

A histogram illustrating the distribution of the download time (in ms) over all the website pages.

High values for all pages may indicate performance problems of the web server hosting the website. High values for single pages likely indicate a too heavy content.

Consider a page download time along with the value of the page size: a high download time with a high page size indicates a page too heavy, a high download time with a low page size indicates performance problems on the server side.

Note: You can access this and many more and richer information about website performances by using the "Performance suggestions" tool.

Download Time Mean

The mean value of the download time of all website pages.

Download Time Median

The median value of the download time of all website pages.

Crawl Options

A grid detailing all crawl options set for the crawl process that produced the current session.