FAQs: Crawl issues
Questions, answers and troubleshooting about possible Crawl Issues.
- The program crawls this site very slowly, with 2s delay between every call. Can I speed it up?
- The program does not crawl a site.
- The program only crawls the first page of a site.
- After crawling a site for a while the crawler gets HTTP 403 responses.
- My site has XXX pages, but the program only finds YYY
- My site has only XXX pages, but the program says it has many more
- Community Edition: I'm exceeding the crawl page limit. How can I crawl more pages?
- Can I use a proxy?
- A site is blocking Visual SEO Studio spider via robots.txt, how to bypass it?
The program crawls this site very slowly, with 2s delay between every call. Can I speed it up?
Check the "Crawl-Delay" reported in the Crawl Progress right panel. If it is greater than zero it is evidenced in red. In this case it could be 2s.
The robots.txt file of the site you are crawling set a Crawl-Delay directive to prevent agents overloading the server resources; Visual SEO Studio respects it up to two seconds (a limit that today is considered highly conservative).
For verified sites - where since you have demonstrated to have administrative permissions you can basically do whatever you want - you can override it by setting a courtesy delay of 0.0s crawl option in the "Crawl speed" tab.
See also the page Managing the Verified Sites list.
The program does not crawl a site.
There are several possibile explanations, each with its own solution:
- No resources where crawled, not even the site robots.txt file
Most likely there has been a network error while attempting to crawl the site. It could be a DNS error, a firewall or a proxy issue. Check any detail in the bottom Output panel. You can configure your proxy setting from the Tools -> Preferences menu item.
Check if you can browse the site with your preferred browser. If the browser can't, it could be an issue with the web site. It could be a temporary issue, or something more serious to address with your site administrators.
- Only the robots.txt file has been crawled, but it reports an error
Visual SEO Studio fully respects the Robots Exclusion Protocol; in order to do it, it has to first download the site robots.txt file to understand what limitations the site administrators asked to comply with. If it cannot read it it conservatively consider it as a "do not crawl" (note that a missing robots.txt file is not considered a block). Check any detail in the bottom Output panel.
- Only the robots.txt file has been crawled, with no errors
Visual SEO Studio fully respects the Robots Exclusion Protocol; in order to do it, it has to first download the site robots.txt file to understand what limitations the site administrators asked to comply with. In this case it could be that the site administrators prevented access to the spider with a Disallow directive in the robots.txt file (note: default Visual SEO Studio user-agent is "Pigafetta"). Check any detail in the bottom Output panel.
If you are a site owner, and you have verified the site, you can ignore or override robots.txt directives, change user-agent, and basically do whatever you want in your own property.
The program only crawls the first page of a site.
The most likely reason is that the web page content has been truncated before any link definition in the HTML.
This could happen in web pages full of on-page CSS and scripts; WP plugins are common offenders. To understand if that is your case, select the page node and look at the Properties right panel. Is the property "Truncated" true? Then look at the Content panel, you'll likely see that the page HTML header is huge and the truncation occurs before the body definition.
To work around the issue: raise the "Maximum download size per URL (KB)" and crawl the website again.
After crawling a site for a while the crawler gets HTTP 403 responses.
This is quite uncommon, as Visual SEO Studio is probably the most polite SEO spider on earth: not only it fully respects Robots Exclusion Protocol, it has an adaptive engine that continually monitors web server response time to avoid overloading it. Nevertheless, site administrators could have set up restrictive policies that might recognize it as a potential resource waste and block it after a while.
To work the issue around, set a proper "courtesy delay" between each HTTP request (note that no parallel HTTP calls will be done, only the processing phase will be concurrent).
My site has XXX pages, but the program only finds YYY
Crawl options can significantly vary what the spider can discover.
The most likely cause is the maximum crawl depth set. Highly paginated contents would not be discovered. Try setting the crawl depth option to the maximum allowed (you might wonder why setting a maximum crawl depth; without that, the program would have no defenses against involuntary "spider traps", like an infinite calendar).
There may be other causes to impede the discovery and/or crawling of site URLs. For example if some pages exceed the maximum download size, their content would be truncated; any link defined in the truncated part would not be seen and consequently crawled.
Other pages may only be linked from within pages blocked by robots.txt; or from private pages. The spider needs to find links in order to follow them, and can only see them in pages it is allowed to visit.
My site has only XXX pages, but the program says it has many more
Most likely the web site has some internal duplication issues. A typical case where the number of URLs can be four times the expected number is when the site responds to both http:// and https:// URL versions, and to both www. and non-www. URL versions. This could be caused by a HTTP/HTTPS migration where no 301 redirects were set in place, where internal links where not all fixed, and no canonical URL was specified.
So, even if your perception is to have only XXX pages, since search engines - and thus also Visual SEO Studio - consider different URLs as different pages, the actual number of pages as seen by a search engine is much bigger. Google may of course recognize the internal duplication, but do not assume it will pick the version you prefer. Search on Google for your site pages using the site: operator to see which URLs it is picking (most other search engines also recognize the site: operator).
Other reasons could be that the same pages are reachable through different URLs. A typical case is the "faceted navigation" you can find in many e-commerce sites, where a product page can be present under multiple categories, or under multiple search filters, and search filters are part of the URL. They are seen by the search engine as internal duplicate content.
The solution here is first using a "canonical URL" to tag each product page with the preferred URL. Once a page is "canonicalized", only the version with the canonical URL will be indexed by the search engine even if more are crawled. Visual SEO Studio, like the search engine spiders, will see and visit all page versions, but will not report them as a duplicate content issue in the HTML Suggestions report, and will mark in its Views the pages non canonical versions in light green to help recognizing them.
Faceted navigation can be also an issue in terms of "crawl budget" consumption, where the search engine spider would repeatedly visit the same logical content wasting time and resourcing instead of prioritizing visits to pages you care more about. To fix that, make sure the spider finds unique crawl paths by using a clear internal navigation, a clean link structure, and blocking undesired crawl paths with Robots Exclusion Protocol rules (i.e. robots.txt file Disallow directives, nofollow attributes, nofollow robots meta tag...).
For web sites hosted on IIS web servers (or other web servers with a case-insensitive file system), difference in character casing in the URL are ignored by the web server (against the official URL specifications), and internal links with the wrong casing would lead to the discovery of a "new" duplicate page instead of a broken link.
In such cases, the cure is locating all such cases with the proper report in HTML Suggestions in Visual SEO Studio, fix the wrong internal links, and use a proper canonical URL meta tag.
Other internal content duplication issues can be caused by useless URL query parameters. This again can happen in faceted navigation, but also in several other cases.
The easiest way to deal with them is tagging each page with the proper canonical URL meta tag.
Community Edition: I'm exceeding the crawl page limit. How can I crawl more pages?
The free Community Edition of Visual SEO Studio sets a limit of 500 in the sum of the number of crawled pages and images.
If you want to crawl more pages, you could un-check the "Crawl images" options (selected by default):
The crawl options for images
Of course, your ability to visualize and inspect images will be lost.
If that is not enough, and you want to be able to make also complete image audits, purchase a Professional licence.
Can I use a proxy?
There are many licit reasons to use a proxy. For example, you might wish to test access to a site with an IP address from another country, or your company IT policies impose using a proxy.
Visual SEO Studio lets you configure a proxy from main menu -> Preferences -> Proxy settings.
Default value is using the same proxy as configured for the operating system (in Windows: "Internet Options"), but you can specify no proxy or a custom proxy.
For a custom proxy you can configure a local network proxy or even an external proxy.
Be aware that a wrong setting could impair the ability of the software to operate; in case of malfunctions please revert back to the default option.
A site is blocking Visual SEO Studio spider via robots.txt, how to bypass it?
In some rare cases a site owner might have added a Disallow: rule to block the product spider "Pigafetta" with the robots.txt file (it is very rare, because Pigafetta is an extremely well behaved Internet citizen, but could happen).
If it is a website owned by you or by a client of yours, you could:
- Have the Disallow: directive removed from the robots.txt file.
- Or add the website to the Verified Sites list (always recommended for your sites).
Once a site is verified, you can crawl it bypassing robots.txt directives, or changing the program user-agent.
Crawl options to bypass a block in robots.txt
If the site is not under your control, and the site owner doesn't want you to crawl his assets, well... we have to respect their wish.
Visual SEO Studio adheres to ethical rules that are simple: your home, you can do whatever you want; someone else's home, you follow the rules of the house.
Some users do bypass this behaviour by using as proxy programs like HTTP Fiddler that permit spoofing your user-agent of outgoing HTTP calls; we do not provide support for the procedure.
No registration required