Manual: Crawl View

The feature "Crawl View" of Visual SEO Studio, documented in detail.

Crawl View

The Crawl View visually shows a site crawl paths, the natural exploration paths a search engine spider takes to visit a web site starting from the home page.
The view is basically the same you can see during the crawl process, but applied to data already crawled and stored.

The view consists in a main tree view plus other auxiliary sheets at the right and bottom panes. Auxiliary sheets have dedicated help pages accessible by simply clicking on them.

What follows is a detailed description of all view controls and fields. You can also learn more reading the Crawl View, see your site Crawl Paths page.

Head info and tools

View switcher

All views share a view switcher at the top of the window, to quickly pass from a data representation to another.
Clicking on its buttons will open - or select it if already open - the corresponding window.

  • Manage Sessions
  • Crawl View
  • Folder View
  • Tabular View

Start URL

The address from where the spider started visiting the website. You will typically insert at the start of a new exploration the website Home Page, usually the "root" address. For explorations of list of URLs the field is not populated.

Session name

You can give your sessions an optional descriptive name. The name can be assigned when choosing the crawl parameters, or at a later time.

Find Pages...

Clicking on the Find Pages... button will pop up a dialog window to let you search pages starting from a text string.

Expand/Collapse nodes

The tree view nodes can be expanded or collapsed at pleasure: each node individually by clicking on the +/- symbol aside the tree node, or grouped.
For the latter case, three controls are at your disposal:

  • Clicking on the Expand All button all nodes will be expanded. All nodes expanded is the default state.
  • Clicking on the Collapse All button all nodes will be collapsed and only the root nodes will be left visible.
  • Using the Expand up to level numeric up/down control the tree view will expand only up to the desidered level.

Show Crawl Progressive

Enables/disables the visibility of the Crawl progressive column.
Crawl progressive can be really helpful in some cases, but in daily usage you usually want it hidden to focus more on things such as status codes, titles and descriptions, so by default it is hidden.

Legend

Clicking on this link will expand a Legend detailing the meaning of icons and colors used.
For a more comprehensive explanation, read Understanding colors used for URLs.

Context menu

Right clicking on an item will pop up a contextual menu:

Crawl View context menu
Crawl View context menu

Context menu command items are:

  • Go to Referrer URL
    Selects in the main view the node related to the "Referrer" URL, i.e. the address where the spider found the link to the resource.
  • Copy URL
    Copies in the clipboard memory the URL of the selected resource.
  • Browse URL
    Navigates with the default browser the URL of the selected resource.
  • Take Screenshot...
    Opens the Take a Screenshot dialog window that will permit you to have a preview, choose the desired resolution, and take a full-height screenshot of the web page.
  • Screenshot History
    Opens the Screenshot History view to show all screenshots taken for the selected resource over time.
  • Find pages linking to the URL
    Opens in a new Tabular View all pages linking to the resource.
  • Find all links to the URL
    Opens the Links Inspector to locate all links pointing to the resource.
  • Find referrer link to the URL
    Selects the right pane DOM view and there hightlights the HTML node where the spider found the link to the resource.

Column headers

Path

This column holds a tree view of the website link structure, the Crawl View described above.
Each node text is the resource path (URL decoded, thus in human-readable form).
For root nodes, the text is the website authority name (the combination of protocol, host name and, if different from the default value, port number).

Crawl progressive

Indicates the progressive number during the crawler exploration.

Thanks to this progressive number you can get an idea on how a search engine spider would explore your website, a piece of information you should take into account when dealing with Crawl Budget issues, typical of large websites.
For example, you may realize the spider takes exploration paths towards content areas you repute less important compared to the ones you think more strategical; in such case you should intervene on the website link structure.

Note: the crawl progressive number is an approximation:
Visual SEO Studio uses an exploration pattern called Breadth-first, which is demonstrated to be the most efficient in finding important contents in absence of external signals; the actual exploration order can slightly change because of the parallelization used for speed reasons during the crawl process. Using a single crawl thread you could make it strictly repeatable.
Search engines exploration patterns are on their part high asynchronous, and exploration priority is weighted by - in Google case - the resources PageRank which could be inflated by external links.

Truncated

States whether size of the resource exceeded download limit.

The maximum amount you can download for a resource can be customized before crawling a website by using the option "Maximum Download Size per URL (KB)". Notice that a limit is necessary to avoid so-called "spider traps".

Possible values are:

  • Blank, when the spider has downloaded the resource completely
  • when the spider could NOT download the resource completely

HTTP Status Code

The HTTP response code received from the web server upon requesting the resource.

Response codes can be summarized in five standard classes:

  • 1xx Informative response – request was received and its processing is going on (it is very unlikely you will ever see a 1xx response code)
  • 2xx Success – request was received successfully, understood, accepted and served (it is the response code you normally want to see).
  • 3xx Redirection – the requested resource is no longer at the address used
  • 4xx Client Error – request has a syntax error or cannot be honored
  • 5xx Server Error – web server were unable to honor an apparently valid request

Some very common answers are for example 200 (OK - the standard response for HTTP requests successfully served), 301 (Moved Permanently - used when a page URL is changed and you don't want to "break" external links to the old URL nor you want to lose the page indexation on search engines and want to preserve its PageRank.

(Redirect) do work as follows: when an old URL is requested, the web server answers the client (a browser, or a search engine spider) with a HTTP code 3xx to report the address has changed, and adding in the HTTP header the new address. The browser will then have to request with a new HTTP call the resource to the new address, and in case of permanent redirect could remember for the future the redirection in order to avoid making a double call when the link to the old address will be clicked again.

Redirects can be implemented on the server side using several methods, depending on the used technology and the platform the web server is running on. For example by configuring the .htaccess file on Apache web servers with generic or specific rules; or with dedicated plugins in a WordPress installation; or in case of web sites in ASP.NET technology with rules expressed in the web.config file, or directives set in the single page, or in the logic of the used CMS engine.

Having redirects is not an error per-se, but if they are detected - as it normally happens - during a normal site crawl navigating internal links, it is sign that such internal links were not updated after the URLs change. It is recommended to update the internal links with the new URLs in order not to slow down user navigation experience and not to waste the crawl budget allotted by the search engine.

Particular attention should be given to the 4xx response codes, which Visual SEO Studio rightly reports as errors.
The 4xx codes you will stumble upon are usually 404 (Resource not found) and the nearly identical (Resource no longer existing). Their presence is symptom of a broken link that should be corrected, because user and search engine can not reach the link destination page.

5xx response codes are errors occurred on the web server when it was trying to build the resource to return to the browser or the spider.
They could be a temporary issue, but they should normally not ignored, better reporting them to the developer and investigate on the server side. 5xx errors are a very bad user experience, make visitors abandon the website, and potentially can cause de-indexation by the search engines if repeated over time.

For a more in-depth description of HTTP response codes you can consult the following page on Wikipedia: HTTP status codes

Title

The HTML page title, as read from the title HTML tag.

This is one of the page elements with greater SEO relevance for a good positioning in search engines. The title should describe efficiently and briefly the page content. It should not be duplicated (no other pages should have the same title) and should not be excessively long to avoid its truncation in the SERP (use the tool "SERP preview" to verify the title were shown in its entirety).

In the past it was common adding among its first words the main keyword, today also synonyms are correctly interpreted by search engines to categorize a page. Keep in mind that today's search engines are much better than in the past in understanding a page semantic content, so ensure your titles are really aligned to the page contents.

Meta Description

The description snippet suggested to be shown in the SERP.

It is specified within the HTML head section using the meta tag with attributes name="description" and content="...".
It should be attractive in order to increase the CTR (Click-Through Rate), the probability for a user to click on the link in SERP to visit the page.