Pigafetta, Visual SEO Studio crawler

Introducing Pigafetta, a well behaved web bot powering Visual SEO Studio

What is the Pigafetta bot, and what it is for

Pigafetta is the web crawler powering Visual SEO Studio, a SEO software tool.
It examines sites structure and html pages to evaluate their SEO robustness.

Pigafetta is a good web citizen:

  • Uses an explicit agent name, "Pigafetta"
  • Always checks the robots.txt file before entering a domain or sub-domain
  • Adheres to the Robots Exclusion Protocol, obeying to the robots.txt syntax and its extensions
  • Its interpretation of the Robots Exclusion Protocol is conservative (i.e. always choosing the most restrictive interpretations).
  • Obeys the Robots Meta Tag
  • Obeys the X-Robots-Tag HTTP header
  • Obeys the crawl-delay parameter (max 2.0)
  • Never exceeds a web server throughput: it adapts to it and always wait for a server response to complete before requesting a new one.
  • Explicits the "From" HTTP header: "pigafetta-bot(at)visual-seo.com"
  • Explicits a reference page in the user agent name ("http://visual-seo.com/Pigafetta-Bot", this page)
  • Explicits the "Referral" HTTP header, to help visited site owners locate broken links
  • Accepts by default session cookies in order not to spawn server resources
  • Obeys the rel="nofollow" link attribute
  • Checks the type attribute of the links, to avoid requesting content it cannot process
  • Downloads (x)html only content
  • Does not spoof other user agents (unless required and authorized by a verified site administrator)

What is Pigafetta user-agent string?

Pigafetta identify itself with the name "Pigafetta"; an example of the user agent it uses (version number might change) is:

Mozilla 5.0 (compatible; Pigafetta/0.5; +http://visual-seo.com/Pigafetta-Bot)

robots.txt - How to prevent Pigafetta from crawling my site?

Pigafetta adheres to the standard robots.txt syntax; in particular it supports the Disallow directive. If you want to prevent Pigafetta from visiting your site, simply add this record to your robots.txt file:

User-agent: Pigafetta
Disallow: /

Pigafetta will obey the first record with a User-Agent equal to "Pigafetta". If there is no such record it will obey the first entry with a User-Agent of "*".

Pigafetta will not retrieve any documents with a URL containing a disallowed string, e.g.:

User-agent: *
Disallow: /private

would cause all URLs starting with "/private" to be disallowed. For example all of the following would not be retrieved:

/private/
/private_www/
/private.html

Pigafetta always checks the robots.txt file before entering a site or subdomain; in case of missing robots.txt file (HTTP 404), it will interpret it the standard way: access allowed.
Any HTTP return value different from "200 OK" or "404 Not Found" will prevent Pigafetta to browse the site for the current session, as it cannot read its access permissions.

Robots Meta Tag

Pigafetta recognizes the noindex and nofollow directives in meta-tags. If you place the following into the head of your page:

<meta name="robots" content="noindex, nofollow" />

or

<meta name="Pigafetta" content="noindex, nofollow" />

Pigafetta will mark the page not to be indexed, and will not follow its links.

X-Robots-Tag HTTP header

The bot can also recognize the noindex and nofollow directives set with HTTP headers. If your web server serves one of the following values in the HTTP response headers:

X-Robots-Tag: noindex, nofollow

or

X-Robots-Tag: Pigafetta: noindex, nofollow

Pigafetta will mark the page not to be indexed, and will not follow its links.

Crawl-Delay

There are more than enough bad behaved web crawlers out there, and one of the worst things a bot can do is bombing a web site with hundreds of web requests a second. It's like attempting a DOS attack, and can also quickly consume the amount of permitted bandwidth a small site owner is granted by his host supplier.

Pigafetta is different. It strictly obeys the Crawl-Delay directive, if present (Crawl-Delay is an extension to the standard robots.txt syntax, permitting a site owner to state what is the minimum amount of seconds it consider fair to wait for a bot between subsequent HTTP requests).
Simply add to your robots.txt file the following entry:

Crawl-delay: 2

and Pigafetta will respect it.
You can also express tenths of seconds, using the dot as decimal separator; maximum value is 2.0 (higher values will be interpreted as 2 seconds).

If your site response time is greater than the crawl delay, Pigafetta will wait longer giving your site time to respond, without overloading it.

Allow directive

The Allow directive is an extension to the original robots.txt syntax. It permits to enable crawling of a subset of a path blocked by a Disallow directive.
There are unfortunately two ways commonly adopted to interpret how to deal with the precedence order of the two directive: by their order in the robots.txt content, or by their length. Pigafetta adopts the same strategy used by Googlebot, and the longer directive value is the one which wins.

"*" and "$" wildcards

Pigafetta fully supports wildcard pattern matching for path values, another extension to the original robots.txt syntax:
"*" designates 0 or more instances of any valid character;
"$" designates the end of the URL

Other behaviour worth noting

  • Pigafetta supports the rel="canonical" HTTP header.
  • Pigafetta permits optionally (not the default behaviour) to accept redirection HTTP 30x codes for robots.txt files, as it might better respect the site owner desire.

Further considerations

We never ran Pigafetta in the wild before first being sure that all these feature were in place and working correctly.

We run a good bunch of unit tests to avoid bugs and regression. However, differently from the bot itself, we all are human beings and can make programming mistakes, so if you notice the bot misbehaving in any way, crawling a page or directory it shouldn't, or with a too high crawl rate, or you just have general inquiries, please feel free to contact us at the address declared by the spider. We will appreciate your feedback.

Being Pigafetta run from within Visual SEO Studio, a desktop SEO software downloadable from everywhere, we cannot provide a fixed range of IP address from where the Pigafetta call would start.
Also there is the physical possibility that more than one user would decide to crawl the same website at the same time, resulting in a tighter crawl frequency, but this is supposedly a fairly remote possibility.

Future improvements

  • Support for noarchive Meta and HTTP directives, in both the generic and bot-specific forms.
  • Support for unavailable_after Meta and HTTP directives, in both the generic and bot-specific forms.

Where does the name Pigafetta come from?

It is common practice to name web crawlers after past human explorers.
Antonio Pigafetta was an Italian scholar travelling with Magellan during the first successful journey around the world, between 1519 and 1522.
Magellan himself died in before completing the feat, and his role was taken by Elcano.

Pigafetta was one of the few survivors to arrive at destination; he left a complete and fascinating journal describing the enterprise, and a description of all the places they visited.
His diary is a pleasant reading I highly recommend. You will enjoy the adventure and discover a nice narrator, free of racialism, sincerely attached to his commander and passionate for science (taking into account the era he lived into), discovery and his duty.

Download for macOS Version 2.5.3 (69.28 MB) Sunday, February 26, 2023 Download for Windows Version 2.5.3 (34.21 MB) Sunday, February 26, 2023 Deutsch, English, Español, Français, Italiano, Polski, Русский

No registration required

Installed it? Read the Getting Started tutorial.