Manual: Crawl a Site
The feature "Crawl a Site" of Visual SEO Studio, documented in detail.
Crawl a Site
Having Visual SEO Studio spider to visit a website is straightforward:
It just takes inserting a Start URL and clicking on the button.
Start URL
Insert here the address from where you want the spider to start visiting the website.
Most of the times you will use the website Home Page URL, which usually is the "root" address (e.g. https://www.example.com/
), but you could also decide to start from another page.
If you do not specify a protocol (http://
or https://
), then http://
will be assumed.
Session Name
You can give your crawl session an optional descriptive name for your own convenience. You will also be able to add it or change it at a later time.
Show/Hide options
The majority of the times the Start URL will be the only parameter you need to specify. Sometimes albeit a website will require you a special treatment.
Clicking on the Show options link will expand the window to let you access a rich set of additional crawl parameters.
Crawl Settings
This first tab sheet permits you to set the most general crawl options. Default values are studied to accomplish the most common needs, point out as soon as possible most road-blocking website errors, and keep the best crawl performances.
Maximum Crawl Depth
The Maximum Crawl Depth is how deep in the website link structure you want the spider to go.
For some websites with many levels of paginated content you may want to increase this value.
The reason why this parameter exists instead of assuming an infinite depth is because of so-called "spider traps":
Some can be intentional, some not. Take the classic example of the "infinite calendar" you can find in many blog sites: each day of the calendar is a link to a virtual page, and there are links to go to the next month... for ever! A web crawler would never stop visiting such a site, without employing some limitations like a maximum crawl depth or a maximum number of visitable pages.
Maximum number of pages/images
The maximum number of contents you want the spider to download. The default is using the maximum number permitted by the Visual SEO Studio edition you are using. For example the free Community Edition permits a maximum of 500 of pages and images.
Only pages and images do count; other files like robots.txt, XML Sitemap files or other assets are not taken into account. HTTP redirections do not count as well.
Maximum Download Size per URL (KB)
The maximum size tolerated for web pages to download. Pages exceeding that size will be truncated.
A truncated page may impair the crawl process: links in the HTML content subsequent the truncation point would not be found and followed. It is not that rare finding sites which pages are - due to some error in the web server configuration - so bloated of useless stuff before the actual content (e.g. tonnes of scripts and CSS in the HTML head
section of some badly configured WP sites, or huge ViewState at the beginning of the HTML body
section in badly conceived old ASP.NET WebForms pages) that no links at all can be found before the truncation point. Only the Home Page would be visited and the crawl session would end. This is exactly one of the cases covered in our troubleshooting FAQ.
In such cases you might want to increase the parameter limit.
We recommend keeping the default limit (512 KB) and change it only when really needed.
There are reasons for the default limit to exist:
-
Increasing the limit does also increase the program memory consumption during the crawl process.
Visual SEO Studio uses memory check-points during the crawl, to prevent crashes to occur due to limitations in the available computer RAM: each few thousand of pages visited, available memory is checked to see if it can accomplish the next task; if available memory is not enough the crawl session ends. We are proud to say that Visual SEO Studio is extremely robust against memory shortages.
Increasing the page limit size would also increase the minimum amount of existing free memory the program requires, increasing the probability of the crawl process to stop before all pages are visited, if the pages actual size is not really that big. -
While Google can be very tolerant and is able to download without truncation pages even up to 15-16 MB in size, it doesn't mean that big HTML pages are good for search engines:
Excessively big pages take long to render in the browser, and are a bad user experience. Search engines tend to demote them in ranking.
We recommend keeping the default truncation limit so to detect size problems as soon as possible.
When you need to increase the limit in order to complete a site exploration, we recommend also auditing the page size by using the Performance suggestions feature. When the excessive page size is a common trait among all website pages, it is normally caused by a bad configuration on the server side or in the main template. You fix it there, and you fix it everywhere. - A limit, whatever high or low, should exist to avoid so-called spider-traps based on infinite-size downloads, conceived by malicious sites to crash web bots by exhausting their RAM work memory.
That being said, we should also add that users of the free Community Edition are extremely unlikely to ever face memory related problems, since they can only download up to 500 pages/images per crawl session. They can freely increase the limit without having to worry about the issue.
On the contrary, users of higher editions who need to crawl big websites with hundred of thousands of URLs should be more aware of the impact that increasing the limit has on memory consumption. Better fix the size issues first before crawling the entire web site.
Note: the limit only concerns web pages. For image files - when their download is enabled by the proper option - the spider applies a limit of 10 MB (which is really high, never keep web images that heavy!) and for XML Sitemaps the limit is the 50 MB limit of the Sitemap protocol.
Maximum Number of Redirects to Follow:
The maximum number of chained HTTP redirections (HTTP 30x response codes) that the spider will follow. Default and maximum possible value is 5, which is the limit used by Google's crawler, "googlebot".
Cross HTTP/HTTPS boundaries
When the spider visits the website HTTPS version, and finds links toward the HTTP version - or vice versa - should it follow those links?
Default value is "true", because you normally do want to understand if there are old links pointing to an old HTTP version. If after a migration from HTTP to HTTPS the correct 301 redirects have been put in place, you will see redirections and will have to update the old links; if not, you will find what a search engine could consider duplicate contents, and will have to set the proper 301 redirect other than fixing the internal links.
Crawl sub-domains
This option, which defaults to "true", tells spider whether to follow links pointing to internal pages located in a different sub-domain.
For example if the spider while visiting a page in www.example.com finds a link pointing to a page in blog.example.com, should it follow or not?
The choice only affects internal links, i.e. links pointing to pages on the same site, but a different sub-domain. Visual SEO Studio is already discarding external links from the normal crawl process, and is able to discriminate whether a sub-domain is part of a website or belongs to a different entity (e.g. it can understand that blog.example.com and forum.example.com belong to the same website, while site1.blogger.com and site2.blogger.com do not).
Unchecking this option is a common solution when you want to audit just a section of the website defined within a specific sub-domain (or within the main "nude domain", e.g. example.com). Keep in mind that resources within the same sub-domain but not linked by any page visited by the spider could not be found.
Crawl also outside of Start Folder
When the Start URL is not the root address but is a resource within a subfolder, unchecking this options tells the spider not to follow internal links pointing outside of the subfolder.
This is a common solution when you want to audit just a section of the website defined within a specific directory. Keep in mind that resources within the same subfolder but not linked by any page visited by the spider could not be found.
This option is enable only when the option Crawl sub-domains is not selected; that's why it appears a slightly indented below it.
Crawl external links
Tells whether the spider should also visit external URLs found in internal links. Default value is "true".
Only the linked external pages will be visited, the spider will not go deeper. The main purpose of the option is enabling you to find broken external links.
Redirected URLs will be followed. External pages will not be taken into account in most analysis reports (Custom Filters is the exception).
Crawl images
Tells whether the spider should also visit internal images resources found via the IMG
tag. Default value is "true".
Unless the Store images flag were checked, the program would download the image file, check its bytes size, and width and height in pixel, but will not store it locally, so you will not be able to preview them in the Content right pane.
Recognized image formats are: JPEG, GIF, PNG, BMP, TIFF, WebP and SVG.
Save images
Only enabled if the Crawl images option is selected (that's why it appears a slightly indented below it), this option tells the program to actually store locally the downloaded images other than examining them. By doing this, you will also be able to preview the images in the Content right pane.
Images exceeding the maximum file size of 10 MB will not be stored.
Default value is "false", because for large sites with many images storing also image data could dramatically increase the disk space required (users using the free Community Edition are less likely to be bothered by disk space consumption, since they can only crawl up to 500 pages/images per crawl session).
Recognized image formats are: JPEG, GIF, PNG, BMP, TIFF, WebP and SVG.
Use this Accept-Language HTTP header
Normally search engine crawlers do not add to their HTTP requests the Accept-Language HTTP header, an optional header used by browsers to tell what is the user's preferred language; it usually is the language used by the browser user interface, but can customized.
There are websites that wrongly assume every visitor uses one, and return a HTTP 500 error to the search engines spiders which in turn are not able to index at least their Home Page. Or they are badly configured to redirect visitors based on the user's language using a permanent 301 redirect (they should use a temporary redirect, like 302 or 307). In order to audit these websites, until their problems are fixed, you can configure the HTTP header with your language of choice.
Crawl Speed
This tab sheet holds all options that can slow down or speed up the visit of a web site by Visual SEO Studio spider.
Visual SEO Studio spider is extremely good at visiting websites going as fast as possible without ever overloading the web server.
Its engine is adaptive and continually monitors the server response time, and relents if the server needs more time. There is no point in trying to go faster if the web server cannot keep up with the pace, you could slow it down or even crash it. Imagine having to audit an e-commerce production websites, visited by hundreds of users at the same time who want to make their purchase... and seeing them leaving (and not buying) because the SEO guy was crawling the site overloading it and making the visit a terrible user experience! That will never happen with Visual SEO Studio.
There are cases though where you might need to alter the default spider behavior. This is the place to do it.
Force a Courtesy-Delay (secs)
The crawler is adaptive and never overloads a site, yet you could occasionally find websites that identify the bot as a potential resource waste and after a while return error codes (e.g. HTTP 403 or similar) to block it.
In such cases you can use this option to keep a lower footprint by setting a courtesy-delay between each HTTP request (note that no parallel HTTP calls will be done then, only the processing phase will be concurrent).
For non-verified sites, if (and only if) a Crawl-Delay is set in the robots.txt file, it will be respected up to 2 secs.
For verified sites, you can override completely the robots.txt Crawl-Delay directive by forcing it to zero or any other value.
Maximum number of concurrent connections
SEO spiders try to speed up website visits by using multiple concurrent HTTP connections, i.e. requesting more web pages at the same time.
Visual SEO Studio does the same, even if its adaptive crawl engine can decide to push less if it detects that web server would get overloaded.
This control lets you tell the spider how much it could push harder if the web server keeps responding fast.
The Visual SEO Studio edition and whether the website is among the Verified Websites can influence the ability of the spider to crawl faster:
For verified sites you can set up to 32 concurrent connection. For non-verified sites, maximum limit is 5.
The free Community Edition can only use 2 concurrent connections at most.
Warning: increasing the number of thread could slow down or hang the server if it cannot keep up with the requests; do it at your own risk (that's why you can force more on verified sites only).
Advanced Settings
This tab sheet holds all advanced options that connot be grouped into a more specific category.
Custom 'Disallow' paths (e.g. /not-here/)
If you want to exclude multiple directories of the website from the exploration, or pages with some particular querystring parameters or path pattern, you can add custom Disallow
rules here.
The accepted syntax is the one you'd use for robots.txt files - just the path, the "Disallow:
" part is implicit.
You can also use the '*
' (designates 0 or more instances of any valid character) and "$
" (designates the end of the URL) wildcards.
You can insert multiple lines.
Use HTTP compression if available
HTTP protocol permits the clients - browsers or bots - to specify they are able to accept compressed content served by the web server. It makes transmission faster. This is the default behavior for Visual SEO Studio spider as well, supporting both GZip and Deflate compression methods.
In some extremely rare cases a buggy web server could send badly compressed content, in that instance you could still audit the website by deselecting the option.
Note: as we said, the cases when you need to deselect this option are extremely rare. Even if an old web server would not support HTTP compression, it would normally respond with a HTTP 406 ("Not acceptable") response code, and Visual SEO Studio spider from there on would automatically continue the exploration without compression.
Ignore upper/lower case differences in URLs
This option is not selected by default. URLs paths are case-sensitive, this means two URLs whose path differ only in casing of one or more characters do point to two distinct resources. It is a web standard, search engines comply to it and you should almost always leave the option unchecked.
So why is the option there?
Some web servers - most notably MS IIS - treat URL paths as case-insensitive. Internal links with wrong casing could lead search engines to see duplicate content. It still is a good thing that Visual SEO Studio does the same: it can detect the duplicate content, report it and help you fix the issue.
Sometimes though you are not the one in charge of fixing the problem. In your role of SEO you might have reported the issue to the dev team, and while still waiting for the fix you might want to temporary ignore the casing issue to investigate other problems. This is likely the only case when you may want to check the option.
Accept session Cookies
This option is selected by default. This is a different behavior from search engines bots - which usually do not accept cookies - and many other SEO spider products.
We recommend keeping the option selected, and unchecking it only if you need to investigate unexpected website behaviors.
Why does Visual SEO Studio spider accepts by default session cookies if search engine bots don't?
There is an extremely good reason to do it, and that's why we recommend to keep the option selected.
Let's first refresh what session cookies are all about:
Web servers use "server sessions", a memory space allotted for each visitor to manage their current state (for example their cart items in case of an e-commerce website). Since the Web is based on HTTP - a "stateless" transport protocol - in order to distinguish HTTP calls coming from different users, the first time a visitor (with a browser, typically) requests a page, they assign to it a unique identifier inside a session cookie. Browser from that moment on will make new page requests adding the session cookie, so that the web server will be able to understand the visitor is not a new one, and refer to its stored state.
Search engine bots do not use session cookies, thus each search bot request of a web page causes the web server to allocate an amount of memory as if it were a new visitor. This amount of memory on the web server computer is not released until the server session "expires", typically after a period of about 20 minutes that no new requests with the assigned session cookie arrive.
This is not always such a big issue because search engine bots tend not to hammer the web server with continuous requests.
SEO spiders are a different beast though. They want to crawl a website fast, and can make tens or hundreds of concurrent web request for extended periods of time. Modern web servers can handle many concurrent requests at a time, but only for peaks.
Let's do some basic maths: suppose a SEO spider - with no cookie support - visits a website with a frequency of 50 pages per second. In 20' it would be 60.000 web requests, each of them allocating a new session memory space on the web server.
Some Apache web servers used to allot at least 7 MB of RAM for each new session. Multiplied by 60.000 it is 410 GB of server memory dedicated to a single SEO spider that is already sucking bandwidth and relenting the e-commerce without buying any good. That memory consumption would likely crash many web servers. That's the reason why e-commerce administrators hate SEO spiders visiting their sites, especially if not authorized!
Visual SEO Studio is different. By supporting session cookies it's memory footprint on the web server is that of a single user. Add that its adaptive crawl engine never overloads the server and does not slow it down, you can understand why Visual SEO Studio is the best SEO tool to audit live high traffic websites.
Add 'Referrer' HTTP header
The HTTP "Referer" header (note the historical typo made by who originally wrote the HTTP protocol, with a single 'r') tells the web server where the requested URL is coming from. Typically a browser decorates it with the source page when a user clicks on a link in it.
Visual SEO Studio adds it as a courtesy to the webmaster: in case of 404 (page not found), a webmaster by checking the server logs could understand where the broken link was found.
This is not necessary for the program functioning, and could be deselected safely.
Download content of pages with HTTP error codes
When auditing page contents under a SEO perspective, you normally worry only about pages returning a 200 OK status code, because only such pages can be indexed by search engines.
Nevertheless there are various reasons to with to analyze them anyway: check for the Analytics tracking code, check whether the page is user-friendly, and so on...
We suggest to disable the option only in those rare case when you need to explore huge sites with many HTTP errors and you need to save as much disk space as possible.
For robots.txt file, treat all HTTP 4xx status codes as "full allow"
According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
Google made the despicable choice to treat some status codes such as 401 "Unauthorized" and 403 "Forbidden" as "allow everything" as well, even if semantically they would mean the contrary!
In order to be able to reproduce Google behavior we added this option, which by default is not selected.
For robots.txt file, treat a redirection to / as "full allow"
According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
A redirection should thus be interpreted as "disallow everything"; unfortunately it is not a rare setting to redirect to the root address (i.e. to the Home Page) a missing file, with the generic rule that applies to a missing the robots.txt as well. It is a disputable practice (Google for example treats generic redirection to the Home Page as "soft 404s"), but common enough anyway that Google made the choice to tolerate this specific case interpreting it like a 404 (besides the webmaster intention should be respected this way).
In order to be able to reproduce Google behavior we added this option, which by default is not selected.
For robots.txt file, treat a redirection to [other domain]/robots.txt as "full allow"
According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
A redirection should thus be interpreted as "disallow everything"; unfortunately it is a common scenario in case like a HTTP->HTTPS migration, or a domain name change, to redirect everything from the old to the new version, robots.txt included.
To permit auditing a site after a HTTP to HTTPS migration when the given Start URL is using the http://
protocol (or the protocol is not specified and http://
is assumed) we added this option, selected by default.
Verified websites
This tab sheet permits to set crawl options that would break the code of conduct for polite bots (identify yourself, respect robots exclusion protocol, do not hog web server resources...).
Since it would be unethical using them on someone else's website without consent, they are available only for websites you can demonstrate to administer. Then you will be able to set options such as override the directives set in the robots.txt file, spoof the user-agent, and use more parallel threads to download the site resources.
To enable these options you just have to verify your site, it only takes a few clicks.
You can demonstrate to be a site administrator by using "Google Search Console" credentials, or a "Bing webmaster tools" API key, or using the native Visual SEO Studio verification.
For local development web servers running on localhost
or 127.0.0.1
you are automatically considered an administrator and the options will always be enabled.
You can learn more about Verified Sites at the page Managing the Verified Sites list.
Verified websites list...
The button
You can learn more about Verified Sites at the page Managing the Verified Sites list.
Ignore robots.txt 'Disallow' directives
Selecting this option will make the spider ignore the Disallow:
directives read in the robots.txt file that would normally prevent visiting some website paths.
Ignore 'nofollow' meta directives
Selecting this option will make the spider ignore the nofollow
directive read in the robots
meta tag file (or in the bot-specific tag, or the equivalent X-Robots-Tag
HTTP headers) that would normally prevent following links found in the page containing the directive.
Ignore rel="nofollow" attributes
Selecting this option will make the spider ignore the nofollow
value read in the rel
attribute - if present - of a link, that would normally prevent following the same link.
Use this User-Agent
This drop down combo box permits you to choose the User-Agent
the spider will use to identify itself when visiting a website pages.
Available options are the user-agents of the most famous search engines - desktop, mobile and image bot versions - and the ones of the most popular web browsers.
When you are exploring a website not listed within the Verified Sites list, the default Visual SEO Studio native user-agent Pigafetta will be used.
HTTP Authentication
When developing a new website it is quite common the need to publish it online to a restricted audience - e.g. the paying client or other stakeholders - to give them a preview of the works.
In such case, you normally don't want search engines to index it. There are several method to prevent it; the recommended way it to restrict it using HTTP authentication.
This tab sheet permits you to specify how the spider will authenticate itself when crawling a HTTP-authentication protected website.
Available options are:
-
None
This value is the default value, used to explore websites not restricted by HTTP authentication.
Attempting to visit a HTTP-authentication protected page would result in a HTTP 403 status code server response. -
Use currently logged-in user's credentials
Using this option will make the spider use the network credential currently used by the computer it is running on.
The program will negotiate with the web server the to determine the authentication scheme. If both client and server support Kerberos, this will be used; otherwise, NTLM will be used. -
Specify credentials
The program will use the credential provided in the fields below, enabled when the option is selected.
You can also specify the authentication schemes to use.
User Name
The network user name to use (e.g. "name" or "domain\name").
Password
The password of the network user.
Basic
Uses the Basic authentication scheme (warning: password in clear, unsafe without SSL).
Basic authentication sends the password across the wire in plain text. That's okay for a secure connection, such as one using HTTPS, and for situations where you don't need much security.
Digest
Uses the Digest authentication scheme.
Digest authentication hashes the password along with other data from the server before sending a response over the wire. It's a significant step up from Basic in terms of security.
Kerberos / NTLM
The program will negotiate with the web server the to determine the authentication scheme. If both client and server support Kerberos, this will be used; otherwise, NTLM will be used.