Manual: Visual Real-Time Crawler
The feature "Visual Real-Time Crawler" of Visual SEO Studio, documented in detail.
Visual Real-Time Crawler
This is the view you have during a crawl process.
The view is basically the same as the Crawl View, applied to data gathered in real-time during the crawl process rather then data already crawled and stored.
It visually shows a site crawl paths, the natural exploration paths a search engine spider takes to visit a web site starting from the home page.
In the upper-right corner of the main tab you can see a symbol resembling the "Recording..." one of video cameras and a related message. In the lower part there is a progress bar detailing the percentual progress of the website exploration and buttons to Pause, Resume and Stop the crawl process.
The view consists in a main tree view plus other auxiliary sheets at the right and bottom panes. Auxiliary sheets have dedicated help pages accessible by simply clicking on them.
In particular, specifically dedicated to this view are the auxiliary panels Progress at the right side, and Output at the bottom.
What follows is a detailed description of all view controls and fields. You can also learn more reading the Visual Real-Time Crawler page.
Head info and tools
All views share a view switcher at the top of the window, to quickly pass from a data representation to another.
Clicking on its buttons will open - or select it if already open - the corresponding window.
- Manage Sessions
- Crawl View
- Folder View
- Tabular View
Note: the View switcher is enabled only once the crawl process is completed or stopped.
The address from where the spider started visiting the website. You will typically insert at the start of a new exploration the website Home Page, usually the "root" address. For explorations of list of URLs the field is not populated.
You can give your sessions an optional descriptive name. The name can be assigned when choosing the crawl parameters, or at a later time.
The tree view nodes can be expanded or collapsed at pleasure: each node individually by clicking on the +/- symbol aside the tree node, or grouped.
For the latter case, three controls are at your disposal:
- Clicking on the button all nodes will be expanded. All nodes expanded is the default state.
- Clicking on the button all nodes will be collapsed and only the root nodes will be left visible.
- Using the Expand up to level numeric up/down control the tree view will expand only up to the desidered level.
Show Crawl Progressive
Enables/disables the visibility of the Crawl progressive column.
Crawl progressive can be really helpful in some cases, but in daily usage you usually want it hidden to focus more on things such as status codes, titles and descriptions, so by default it is hidden.
Clicking on this link will expand a Legend detailing the meaning of icons and colors used.
For a more comprehensive explanation, read Understanding colors used for URLs.
Right clicking on an item will pop up a contextual menu:
Visual Real-Time Crawler context menu
Context menu command items are:
Go to Referrer URL
Selects in the main view the node related to the "Referrer" URL, i.e. the address where the spider found the link to the resource.
Copies in the clipboard memory the URL of the selected resource.
Navigates with the default browser the URL of the selected resource.
Opens the Take a Screenshot dialog window that will permit you to have a preview, choose the desired resolution, and take a full-height screenshot of the web page.
Opens the Screenshot History view to show all screenshots taken for the selected resource over time.
Find pages linking to the URL
Opens in a new Tabular View all pages linking to the resource.
Note: available only once the crawl process is completed or stopped.
Find all links to the URL
Opens the Links Inspector to locate all links pointing to the resource.
Note: available only once the crawl process is completed or stopped.
Find referrer link to the URL
Selects the right pane DOM view and there hightlights the HTML node where the spider found the link to the resource.
This column holds a tree view of the website link structure, the Crawl View described above.
Each node text is the resource path (URL decoded, thus in human-readable form).
For root nodes, the text is the website authority name (the combination of protocol, host name and, if different from the default value, port number).
Indicates the progressive number during the crawler exploration.
Thanks to this progressive number you can get an idea on how a search engine spider would explore your website, a piece of information you should take into account when dealing with Crawl Budget issues, typical of large websites.
For example, you may realize the spider takes exploration paths towards content areas you repute less important compared to the ones you think more strategical; in such case you should intervene on the website link structure.
Note: the crawl progressive number is an approximation:
Visual SEO Studio uses an exploration pattern called Breadth-first, which is demonstrated to be the most efficient in finding important contents in absence of external signals; the actual exploration order can slightly change because of the parallelization used for speed reasons during the crawl process. Using a single crawl thread you could make it strictly repeatable.
Search engines exploration patterns are on their part high asynchronous, and exploration priority is weighted by - in Google case - the resources PageRank which could be inflated by external links.
States whether size of the resource exceeded download limit.
The maximum amount you can download for a resource can be customized before crawling a website by using the option "Maximum Download Size per URL (KB)". Notice that a limit is necessary to avoid so-called "spider traps".
Possible values are:
- Blank, when the spider has downloaded the resource completely
- when the spider could NOT download the resource completely
HTTP Status Code
The HTTP response code received from the web server upon requesting the resource.
Response codes can be summarized in five standard classes:
- 1xx Informative response – request was received and its processing is going on (it is very unlikely you will ever see a 1xx response code)
- 2xx Success – request was received successfully, understood, accepted and served (it is the response code you normally want to see).
- 3xx Redirection – the requested resource is no longer at the address used
- 4xx Client Error – request has a syntax error or cannot be honored
- 5xx Server Error – web server were unable to honor an apparently valid request
Some very common answers are for example 200 (OK - the standard response for HTTP requests successfully served), 301 (Moved Permanently - used when a page URL is changed and you don't want to "break" external links to the old URL nor you want to lose the page indexation on search engines and want to preserve its PageRank.
(Redirect) do work as follows: when an old URL is requested, the web server answers the client (a browser, or a search engine spider) with a HTTP code 3xx to report the address has changed, and adding in the HTTP header the new address. The browser will then have to request with a new HTTP call the resource to the new address, and in case of permanent redirect could remember for the future the redirection in order to avoid making a double call when the link to the old address will be clicked again.
Redirects can be implemented on the server side using several methods, depending on the used technology and the platform the web server is running on. For example by configuring the .htaccess file on Apache web servers with generic or specific rules; or with dedicated plugins in a WordPress installation; or in case of web sites in ASP.NET technology with rules expressed in the web.config file, or directives set in the single page, or in the logic of the used CMS engine.
Having redirects is not an error per-se, but if they are detected - as it normally happens - during a normal site crawl navigating internal links, it is sign that such internal links were not updated after the URLs change. It is recommended to update the internal links with the new URLs in order not to slow down user navigation experience and not to waste the crawl budget allotted by the search engine.
Particular attention should be given to the 4xx response codes, which Visual SEO Studio rightly reports as errors.
The 4xx codes you will stumble upon are usually 404 (Resource not found) and the nearly identical (Resource no longer existing). Their presence is symptom of a broken link that should be corrected, because user and search engine can not reach the link destination page.
5xx response codes are errors occurred on the web server when it was trying to build the resource to return to the browser or the spider.
They could be a temporary issue, but they should normally not ignored, better reporting them to the developer and investigate on the server side. 5xx errors are a very bad user experience, make visitors abandon the website, and potentially can cause de-indexation by the search engines if repeated over time.
For a more in-depth description of HTTP response codes you can consult the following page on Wikipedia: HTTP status codes
The HTML page title, as read from the
title HTML tag.
This is one of the page elements with greater SEO relevance for a good positioning in search engines. The title should describe efficiently and briefly the page content. It should not be duplicated (no other pages should have the same title) and should not be excessively long to avoid its truncation in the SERP (use the tool "SERP preview" to verify the title were shown in its entirety).
In the past it was common adding among its first words the main keyword, today also synonyms are correctly interpreted by search engines to categorize a page. Keep in mind that today's search engines are much better than in the past in understanding a page semantic content, so ensure your titles are really aligned to the page contents.
The description snippet suggested to be shown in the SERP.
It is specified within the HTML
head section using the
meta tag with attributes
It should be attractive in order to increase the CTR (Click-Through Rate), the probability for a user to click on the link in SERP to visit the page.