Manual: Session Progress
The feature "Session Progress" of Visual SEO Studio, documented in detail.
Session Progress
A Crawl Session consists of the data set resulting from the spider visiting a web site.
This panel shows an istantaneous snapshot, updated in real-time, of the website exploration progress.
In the upper part you can see the main measures of the running crawl session.
In the lower part are graphs - them too updated in real-time as the crawl proceeds - illustrating an overall view of HTTP response codes, link depth and download time.
Main properties
Queued URLs
The number of URLs still in the exploration queue.
As the spider finds links, it reads their URL and if this was not already met it enqueues it to the list of URLs to be visited.
The exploration process ends when no new URLs are in the queue, or when the maximum number of pages is reached (for example in the free Community Edition the maximum number is 500 pages and/or images), or if stopped by the user.
HTTP Requests
The number of HTTP requests performed so far.
Visited Pages
The number of web pages visited so far. Just web pages, it does not count HTTP requests to images or other resources.
Max pages/images Nr
The maximum number of pages and/or images visitable by the spider, as set in the crawl options.
Elapsed Time
The time so far spent so far exploring the website.
Simulated G-time
The minimum time a web search crawler would have spent so far to visit the same number of pages.
The ratio - reported in the G. Crawl Delay field - can be customized via the Simulated Crawl Rate option panel, which you can access from the program main menu at the entry -> Preferences....
This piece of information is important when investigating crawl budget issues and apparent slowness at being indexed, because it shows how long would a search engine web crawler take to visit the website.
Crawl-Delay
The istantaneous courtesy delay - i.e. the delay between each HTTP request - applied by the spider.
It normally is zero seconds, except when the explored website has a Crawl-Delay directive in the robots.txt file and the website is not in the Verified Sites list (the Crawl-Delay directive is respected up to 2 seconds), or when explicitly set by the user to slow down the visit to a website.
Keep in mind that when the value is not zero, parallel HTTP requests cannot be performed (but parsing and processing the results can).
When different from zero seconds, the value is hightlighted in red.
G. Crawl-Delay
The courtesy delay that would be applied by a search engine spider between each page request.
It can be customized via the Simulated Crawl Rate option panel, which you can access from the program main menu at the entry -> Preferences....
You should bother about it when investigating crawl budget issues and apparent slowness at being indexed, because it makes you understand how long time would a search engine web crawler take to visit the website.
Session summary
Status Codes
A pie chart breaking down the HTTP response codes received from the web server for the whole crawl session.
Response codes can be summarized in five standard classes:
- 1xx Informative response – request was received and its processing is going on (it is very unlikely you will ever see a 1xx response code)
- 2xx Success – request was received successfully, understood, accepted and served (it is the response code you normally want to see).
- 3xx Redirection – the requested resource is no longer at the address used
- 4xx Client Error – request has a syntax error or cannot be honored
- 5xx Server Error – web server were unable to honor an apparently valid request
Some very common answers are for example 200 (OK - the standard response for HTTP requests successfully served), 301 (Moved Permanently - used when a page URL is changed and you don't want to "break" external links to the old URL nor you want to lose the page indexation on search engines and want to preserve its PageRank.
(Redirect) do work as follows: when an old URL is requested, the web server answers the client (a browser, or a search engine spider) with a HTTP code 3xx to report the address has changed, and adding in the HTTP header the new address. The browser will then have to request with a new HTTP call the resource to the new address, and in case of permanent redirect could remember for the future the redirection in order to avoid making a double call when the link to the old address will be clicked again.
Redirects can be implemented on the server side using several methods, depending on the used technology and the platform the web server is running on. For example by configuring the .htaccess file on Apache web servers with generic or specific rules; or with dedicated plugins in a WordPress installation; or in case of web sites in ASP.NET technology with rules expressed in the web.config file, or directives set in the single page, or in the logic of the used CMS engine.
Having redirects is not an error per-se, but if they are detected - as it normally happens - during a normal site crawl navigating internal links, it is sign that such internal links were not updated after the URLs change. It is recommended to update the internal links with the new URLs in order not to slow down user navigation experience and not to waste the crawl budget allotted by the search engine.
Particular attention should be given to the 4xx response codes, which Visual SEO Studio rightly reports as errors.
The 4xx codes you will stumble upon are usually 404 (Resource not found) and the nearly identical (Resource no longer existing). Their presence is symptom of a broken link that should be corrected, because user and search engine can not reach the link destination page.
5xx response codes are errors occurred on the web server when it was trying to build the resource to return to the browser or the spider.
They could be a temporary issue, but they should normally not ignored, better reporting them to the developer and investigate on the server side. 5xx errors are a very bad user experience, make visitors abandon the website, and potentially can cause de-indexation by the search engines if repeated over time.
For a more in-depth description of HTTP response codes you can consult the following page on Wikipedia: HTTP status codes
Link Depth
A histogram illustrating the "link depth" distribution over all the website pages (assuming the crawl process starteded from the Home Page).
Link depth, also known as "crawl depth", is the depth of the page in the site link structure, i.e. the number of clicks needed to reach it starting from the Home Page.
Knowing a page depth from the main URL is important because search engines give more or less importance to a page relatively to its distance from the main URL: the closer, the more important it is.
Note: this is a simplification; in the case of Google for example usually the Home Page is the page with greater PageRank (a Google measure to assess the importance of a page, other search engines use similar models), the pages connected with a single link to the Home Page are thus the ones receiving more PageRank.
Furthermore, the greater the distance, the less likely is the page to be reached and explored by the search engine spiders, because of the normally limited Crawl Budget (simplifying: the number of pages a search engine can explore within a certain time slot when visiting a website).
Thus, place the pages you want to give more weight closer to the Home Page.
Link Depth is also important from a user perspective: it would be hard for them to find a content starting from the Home Page if it takes many clicks to reach it.
A common usability rule wanted each page reachable with three clicks or less. This is not always possible in case of very large websites, nevertheless you should choose a link structure that minimizes each page link depth.
Link Depth Mean
The mean value of the link depth of all website pages.
Link Depth Median
The median value of the link depth of all website pages.
Download Time
A histogram illustrating the distribution of the download time (in ms) over all the website pages.
High values for all pages may indicate performance problems of the web server hosting the website. High values for single pages likely indicate a too heavy content.
Consider a page download time along with the value of the page size: a high download time with a high page size indicates a page too heavy, a high download time with a low page size indicates performance problems on the server side.
Note: You can access this and many more and richer information about website performances by using the "Performance suggestions" tool.
Download Time Mean
The mean value of the download time of all website pages.
Download Time Median
The median value of the download time of all website pages.
Crawl Options
A grid detailing all crawl options set for the crawl process that produced the current session.