Manual: Data Extraction

The feature "Data Extraction" of Visual SEO Studio, documented in detail.

Data Extraction

The Data Extraction feature permits you to create tables with data from the pages of a crawl session, extracted the powerful XPath query language (XPath 1.0 version).

Suppose for example you want to find all links to your domain not using the HTTPS protocol, or all external links with target="_blank" but without rel="noopener" (and thus with a security flaw); by setting the proper XPath expression you can easily find all elements you looked for.

Note: these two are just examples, actually the Links Inspector feature already automates both searches for you.

Toolbar

New

An extraction begins with creating a new expression set.
Clicking on the New button, a drop down menu will expand with the following entries:

  • New, to create a new empty expression set
  • New from existing..., to create a new expression set by copying an existing one; a window dialog will pop up to let you select the expression set you want to copy

Open

To open an expression set previously created click on the Open button, a dialog window will appear to select the desired set.

Save

With the Save button you can save your XPath expression set for later use.
The program alerts you if there are unsaved expression sets before closing, or when you attempt to close the tab without having saved your changes.

Discard Changes

If you changed your XPath expression set and do not want to save your changes, you can click on the Discard Changes button; the expression set will revert to its initial state before your changes, or in case of new set it will be deleted.

Delete

The Delete button removes the expression set currently open. Before removing it Visual SEO Studio will prompt you to confirm the operation.

Extract Data

Once you added all the columns for the desired XPath expressions, you can proceed with the extraction by clicking on the Extract Data button; the table will fill up with the searched data.

Like in all other Visual SEO Studio grids, the table content can be exported using the context menu you get by clicking (on Windows) on the most upper left corner of the table, or (on Mac) right clicking any cell of the table.
The context menu has the following options:

  • Choose columns...
    to show or hide columns of the table
  • Find value in grid...
    to search for a specific value in the table cells
  • Export to Excel...
    to export the content of the shown columns to an Excel document
  • Export to CSV...
    to export the content of the shown columns to a CSV file
  • Add Google "Search Analytics" data...
    to add columns with data from Google
  • Add Bing/Yahoo "Page Traffic" data...
    to add columns with Bing/Yahoo
  • Add data from "Moz"...
    to add columns with data from Moz

Extraction criteria

Column set name

With this field you assign a name to the XPath expression set (which conceptually is a column set).
When you create a new expression set the program generates automatically a new name; we suggest before saving to customize it with a meaningful name so you'll find it more easily. You can always change it later.

Add Column

The Add Column button permits to add a new column bound to a new XPath expression.

Delete Column

The Delete Column button permits to remove a column bound to a XPath expression. Before removal you will be prompted to confirm the choice.

Column name

Insert in this field the name you wish to appear in the result table as header of the column bound to the XPath expression.

XPath to content

This is the key field of the data extraction functionality, it permits you to leverage the expressive potential of XPath.

Writing a full treaty on XPath syntax and inner working is out of the scope of this page, we'll limit here to provide some basis to understand how to use it.

Hyper simplifying it, a HTML page is organized as a hierarchical structure called DOM ("Document Object Model"), seen as a tree of nodes. The nodes represent the tags (called "elements" in DOM) and their attributes.
XPath permits to query the DOM to find collections of tags, attributes and text within the tags.

Here is a brief description of the most common elements composing a XPath expression:

  • /:
    Specifies the position of the node (or nodes) to be found within the hierarchy. At the start of the expression indicates a select from the root node, which in the case of a HTML page is the <html> tag; the expression /html returns thus the <html> root node.
    You can specify any position within the hierarchy, for example /html/head/*; in this case the result will be all meta-tags and the other elements within the <head> element.
  • //:
    This way you can search for nodes regardless their position within the tree hierarchy. For example, the expression //img returns all img tags within the page.
  • *:
    We already have see this wildcard character, it matches any element node. In the previous example /html/head/* were returned all child elements of the tag <head>. Used along with @, i.e. @*, returns all attributes of an element.
  • ():
    Parentheses permit grouping; the are also used to force a priority order for the operators.
  • []:
    Matches a single element of a collection. For example if we wanted to find the first H1 title of a page our expression would be (//h1)[1] (we use parentheses to force operators order, because [] would have precedence over //).
    You can use functions like last() or position() to get for example the last H1 (//h1[last()]) or the first three H1s (//h1[position()<=3]).
    You can also do in-text searches using functions like starts-with(), contains() or even regular expressions with matches().
  • @:
    Selects an attribute. For example the expression //a[@target='_blank'] finds all links with attribute target equal to _blank, i.e. all links that open in a new browser tab.
  • =, !=, <=, >=:
    With these operators you can make searches based on attribute value. Previously with the expression //a[@target='_blank'] we have seen for example the usage of the equality operator =.
  • and, or, not:
    With logical operators you can also concatenate conditions.
  • |:
    You can also concatenate several XPath expressions thanks to the | (pipe) operator to join results in a single column.

To finally answer the two questions at the beginning:

  • to find all links to external domains with target='_blank' and missing the value noopener in the rel attribute, use the following XPath expression:
    //a[contains(@href, 'yourdomain.com')=false and @target='_blank' and contains(@rel,'noopener')=false]
  • and to find all links to your domain that use the unsafe HTTP protocol, use the expression:
    //a[starts-with(@href, 'http://www.yourdomain.com')]

Again: these were given for demonstration purpose, you will find much more immediate using the Links Inspector feature to answer these and many other advanced queries on links.

What to extract

You can specify what to extract from the nodes identified by the XPath expression. There are three alternatives:

  • InnerText
    For each element (tag) extract the plain text inside it.
  • InnerHtml
    For each element (tag) extract the HTML code contained within the tag.
  • OuterHtml
    For each element (tag) extract the whole HTML code, tag included.

Extract only first element in case of multiple results

By checking this option you can limit the result to the first occurrence found within the page.

Column headers

URL

The URL of the page containing the result found with the data extraction.

Title

The title of the page containing the result found with the data extraction.

Other columns

These are the added columns bound to the XPath expressions used to extract data.