Manual: Crawl a list of URLs
The feature "Crawl a list of URLs" of Visual SEO Studio, documented in detail.
Crawl a list of URLs
This feature permits to crawl URL lists from various domains in order to audit a site backlink profile.
You can import backlink URLs from all major backlink intelligence providers, the program recognizes their proprietary CSV formats.
You can import URLs from multiple sources, the original lists will be merged and duplicates discarded.
To learn more about the feature, please read the Crawl URL Lists: off-site analysis page.
Add from Clipboard
URLs can be imported from the clipboard memory, if you copied a text with a list of URLs.
Clicking on the button will import the copied URLs. Text rows not recognized as URLs will be skipped.
Add from CSV file
You can import URLs from CSV files exported from Google Search Console, Bing Webmaster Tools, Yandex.Webmaster, and from all major backlink intelligence providers.
Clicking on the button this will expand to let you choose the desired CVS format, and then will open a window to select the CSV file and preview the URLs to import.
If you are not sure about what CSV format you have, don't worry: you wil be able to test the different import schemes and change the one to use before importing the URLs.
Explore also linked URLs on this domain
When crawling a list to audit a backlink profile, it's highly recommended to specify which domain the analysis will refer to.
This way the spider will also crawl the destination URLs when it finds links pointing to such domain.
Number of URLs
The total number of unique URLs imported into the list.
This number could be be less then the number in the original list, because duplicates are removed.
Session Name (optional)
You can give your crawl session an optional descriptive name for your own convenience. You will also be able to add it or change it at a later time.
URL list
This tab sheet lists all imported URLs to be crawled.
URL
Column holding the imported URLs of the pages to crawl.
Domains
This tab sheet lists all distinct domain names, extracted from the imported URLs.
Note: the list can be exported to Excel/CSV.
Name
The domain name.
Count
The number of URLs in the list belonging to the domain.
Sub Domains
This tab sheet lists all distinct sub-domain names, extracted from the imported URLs.
Note: the list can be exported to Excel/CSV.
Name
The sub-domain name.
Count
The number of URLs in the list belonging to the sub-domain.
Show/Hide options
Clicking on the Show options link will expand the window to let you access further crawl parameters.
Download content of pages with HTTP error codes
When auditing page contents under a SEO perspective, you normally worry only about pages returning a 200 OK status code, because only such pages can be indexed by search engines.
Nevertheless there are various reasons to with to analyze them anyway: check for the Analytics tracking code, check whether the page is user-friendly, and so on...
We suggest to disable the option only in those rare case when you need to explore huge sites with many HTTP errors and you need to save as much disk space as possible.
Ignore robots.txt 'Disallow' directives (verified sites only)
Selecting this option will make the spider ignore the Disallow:
directives read in the robots.txt file that would normally prevent visiting some website paths.
For robots.txt file, treat all HTTP 4xx status codes as "full allow"
According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
Google made the despicable choice to treat some status codes such as 401 "Unauthorized" and 403 "Forbidden" as "allow everything" as well, even if semantically they would mean the contrary!
In order to be able to reproduce Google behavior we added this option, which by default is not selected.
For robots.txt file, treat a redirection to / as "full allow"
According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
A redirection should thus be interpreted as "disallow everything"; unfortunately it is not a rare setting to redirect to the root address (i.e. to the Home Page) a missing file, with the generic rule that applies to a missing the robots.txt as well. It is a disputable practice (Google for example treats generic redirection to the Home Page as "soft 404s"), but common enough anyway that Google made the choice to tolerate this specific case interpreting it like a 404 (besides the webmaster intention should be respected this way).
In order to be able to reproduce Google behavior we added this option, which by default is not selected.
For robots.txt file, treat a redirection to [other domain]/robots.txt as "full allow"
According to the original robots.txt specifications, a missing file (404 or 410) should be interpreted as "allow everything" and all other status code should be interpreted as "disallow everything".
A redirection should thus be interpreted as "disallow everything"; unfortunately it is a common scenario in case like a HTTP->HTTPS migration, or a domain name change, to redirect everything from the old to the new version, robots.txt included.
To permit auditing a site after a HTTP to HTTPS migration when the given Start URL is using the http://
protocol (or the protocol is not specified and http://
is assumed) we added this option, selected by default.
Maximum number of concurrent connections:
SEO spiders try to speed up website visits by using multiple concurrent HTTP connections, i.e. requesting more web pages at the same time.
Visual SEO Studio does the same, even if its adaptive crawl engine can decide to push less if it detects that web server would get overloaded.
This control lets you tell the spider how much it could push harder if the web server keeps responding fast.
Maximum limit is 5. The free Community Edition can only use 2 concurrent connections at most.
Warning: increasing the number of thread could slow down or hang the server if it cannot keep up with the requests; do it at your own risk (that's why you can force more on verified sites only).