Manual: Data Extraction
The feature "Data Extraction" of Visual SEO Studio, documented in detail.
The Data Extraction feature permits you to create tables with data from the pages of a crawl session, extracted the powerful XPath query language (XPath 1.0 version).
Suppose for example you want to find all links to your domain not using the HTTPS protocol, or all external links with target="_blank" but without rel="noopener" (and thus with a security flaw); by setting the proper XPath expression you can easily find all elements you looked for.
Note: these two are just examples, actually the Links Inspector feature already automates both searches for you.
An extraction begins with creating a new expression set.
Clicking on the New button, a drop down menu will expand with the following entries:
- New, to create a new empty expression set
- New from existing..., to create a new expression set by copying an existing one; a window dialog will pop up to let you select the expression set you want to copy
To open an expression set previously created click on the Open button, a dialog window will appear to select the desired set.
With the Save button you can save your XPath expression set for later use.
The program alerts you if there are unsaved expression sets before closing, or when you attempt to close the tab without having saved your changes.
If you changed your XPath expression set and do not want to save your changes, you can click on the Discard Changes button; the expression set will revert to its initial state before your changes, or in case of new set it will be deleted.
The Delete button removes the expression set currently open. Before removing it Visual SEO Studio will prompt you to confirm the operation.
Once you added all the columns for the desired XPath expressions, you can proceed with the extraction by clicking on the Extract Data button; the table will fill up with the searched data.
Like in all other Visual SEO Studio grids, the table content can be exported using the context menu you get by clicking (on Windows) on the most upper left corner of the table, or (on Mac) right clicking any cell of the table.
The context menu has the following options:
to show or hide columns of the table
Find value in grid...
to search for a specific value in the table cells
Export to Excel...
to export the content of the shown columns to an Excel document
Export to CSV...
to export the content of the shown columns to a CSV file
Add Google "Search Analytics" data...
to add columns with data from Google
Add Bing/Yahoo "Page Traffic" data...
to add columns with Bing/Yahoo
Add data from "Moz"...
to add columns with data from Moz
Column set name
With this field you assign a name to the XPath expression set (which conceptually is a column set).
When you create a new expression set the program generates automatically a new name; we suggest before saving to customize it with a meaningful name so you'll find it more easily. You can always change it later.
The Add Column button permits to add a new column bound to a new XPath expression.
The Delete Column button permits to remove a column bound to a XPath expression. Before removal you will be prompted to confirm the choice.
Insert in this field the name you wish to appear in the result table as header of the column bound to the XPath expression.
XPath to content
This is the key field of the data extraction functionality, it permits you to leverage the expressive potential of XPath.
Writing a full treaty on XPath syntax and inner working is out of the scope of this page, we'll limit here to provide some basis to understand how to use it.
Hyper simplifying it, a HTML page is organized as a hierarchical structure called DOM ("Document Object Model"), seen as a tree of nodes. The nodes represent the tags (called "elements" in DOM) and their attributes.
XPath permits to query the DOM to find collections of tags, attributes and text within the tags.
Here is a brief description of the most common elements composing a XPath expression:
Specifies the position of the node (or nodes) to be found within the hierarchy. At the start of the expression indicates a select from the root node, which in the case of a HTML page is the <html> tag; the expression
/htmlreturns thus the <html> root node.
You can specify any position within the hierarchy, for example
/html/head/*; in this case the result will be all meta-tags and the other elements within the <head> element.
This way you can search for nodes regardless their position within the tree hierarchy. For example, the expression
//imgreturns all img tags within the page.
We already have see this wildcard character, it matches any element node. In the previous example
/html/head/*were returned all child elements of the tag <head>. Used along with
@*, returns all attributes of an element.
Parentheses permit grouping; the are also used to force a priority order for the operators.
Matches a single element of a collection. For example if we wanted to find the first H1 title of a page our expression would be
(//h1)(we use parentheses to force operators order, because
would have precedence over
You can use functions like
position()to get for example the last H1 (
//h1[last()]) or the first three H1s (
You can also do in-text searches using functions like
contains()or even regular expressions with
Selects an attribute. For example the expression
//a[@target='_blank']finds all links with attribute target equal to _blank, i.e. all links that open in a new browser tab.
With these operators you can make searches based on attribute value. Previously with the expression
//a[@target='_blank']we have seen for example the usage of the equality operator
With logical operators you can also concatenate conditions.
You can also concatenate several XPath expressions thanks to the
|(pipe) operator to join results in a single column.
To finally answer the two questions at the beginning:
to find all links to external domains with target='_blank' and missing the value noopener in the rel attribute, use the following XPath expression:
//a[contains(@href, 'yourdomain.com')=false and @target='_blank' and contains(@rel,'noopener')=false]
and to find all links to your domain that use the unsafe HTTP protocol, use the expression:
Again: these were given for demonstration purpose, you will find much more immediate using the Links Inspector feature to answer these and many other advanced queries on links.
What to extract
You can specify what to extract from the nodes identified by the XPath expression. There are three alternatives:
For each element (tag) extract the plain text inside it.
For each element (tag) extract the HTML code contained within the tag.
For each element (tag) extract the whole HTML code, tag included.
Extract only first element in case of multiple results
By checking this option you can limit the result to the first occurrence found within the page.
The URL of the page containing the result found with the data extraction.
The title of the page containing the result found with the data extraction.
These are the added columns bound to the XPath expressions used to extract data.