Manual: Readability Analysis

The feature "Readability Analysis" of Visual SEO Studio, documented in detail.

Readability Analysis

The feature Readability Analysis can assist you in performing content audits. It computes readability scores, average readability level, number of words, sentences and characters of a site pages.

Readability Analysis in Visual SEO Studio
Readability Analysis in Visual SEO Studio

Readability formulas are well accepted tools to automatically assess the ease of reading of a text.
They are based on specific language statistics and work under some assumptions:

  • that the text they are applied to is written in the language the formula was build for;
  • that the text they are applied to is actually a meaningful text, written correctly;
  • that the text they are applied to is sufficiently long to respect the language statistics.

Readability (meant as the index computed by a traditional readability formula) is not directly a ranking factor, yet an easy to understand text is a better user experience in the eyes of a search engine and it is more likely to be promoted by it.
To know more about readability and SEO, read the article Readability and Search Engines.

Analysis criteria

Text language

Use this option to select the language of the texts to analyze. Each language has different statistics and syllabification rules, and needs to be analyzed using specific readability formulas.

Text language options in Readability Analysis
Text language options

We recommend always setting the correct language if available, even if you are only interested in the counters and not in readability scores: correct sentence computation takes into account the most common abbreviations for the selected language. For example, the text "Mr. Brown and Dr. Fox" is a single sentence in English.

XPath to content (single node)

An optional XPath expression to locate exactly within the page template the DOM element (i.e. the HTML tag) containing the actual page meaningful content.
The program already does it's best to recognize and remove all boilerplate content: menus, header, footer, sidebar, etc. (you can use the Plain text viewer on the right pane to check the result), but with a proper XPath expression you can pinpoint the actual content with better precision.
The XPath expression should return a single node; in case it returned a collection of nodes, only the first would be taken into account.

Analyze texts

Clicking on the Analyze texts button the program will analyze the texts of the pages of the crawl session matching the given criteria; it will apply the readability formulas for the selected language.
The results will fill the table below.

Show/Hide options

Clicking on the Show options link will to let you access further analysis options.

Path filter

You can filter the set of pages within the crawl session to only examine a subset of it.
Insert in this field the rules of exclusion or inclusion, using the the syntax specified in the "Path filter syntax" field.

Path filter syntax

The syntax to use for interpreting the value of the field "Path filter".

  • Regular expression
    You can use RegEx for complex expressions to filter the set of pages to be examined.
  • 'Allow' inclusion directive (robots.txt syntax)
    Suppose you want to analyze the English pages of your site, all organized within a folder named with the language code.
    You could insert the value:
    /en/
    Only the pages within that folder will be analyzed.
    We are not limited to folder names: the robots.txt syntax permits to match any part of the page path (only the path, not the domain name part of the URL).
    For example using a wildcard character like in the follow example:
    *conta
    you can include in your selection only pages containing in their path that sequence of letters, like /en/contacts.html or /it/contatti.html and so on.
    To know more about the Allow directive syntax, you can read Google robots.txt documentation.
  • 'Disallow' exclusion directive (robots.txt syntax)
    Suppose you have a bilingual website, with all English content organized within a folder named with the language code, and the rest in Italian, and you want to analyze the Italian pages.
    You could insert the value:
    /en/
    The pages within that folder will be excluded from the analysis.
    We are not limited to folder names: the robots.txt syntax permits to match any part of the page path (only the path, not the domain name part of the URL).
    For example using a wildcard character like in the follow example:
    *conta
    you can exclude in your selection pages containing in their path that sequence of letters, like /en/contacts.html or /it/contatti.html and so on.
    To know more about the Disallow directive syntax, you can read Google robots.txt documentation.

Skip 'noindex' pages

Flagging this option will make the pages marked not to be indexed by search engines to be excluded from the analysis.

Skip non-Canonical URLs

Flagging this option will make the pages marked as canonicalized to a different URL to be excluded from the analysis.

Skip empty elements

After the program has trimmed the page content to automatically remove the boilerplate part (menus, header, footer, sidebar, etc.), or subsequently to applying your XPath rule inserted in the "XPath to content (single node)", some pages could have no text left to analyze.
Flagging this option will make such pages to be excluded from the analysis.

Colorize cells

Flagging this option will make the background of the readability score columns painted with a color gradient spanning from bright green (good) to bright red (bad) based on the value of the score and its meaning.

Skip pages with less words than ...

Some readability formulas have been conceived to be statistically meaningful when the text has a minimum number of words.
For example Flesch reading ease and all derived formulas would require at least 100 words, even if they are often used with shorter texts.
With this option you can choose to exclude from the analysis pages having an extracted text shorted than a given number of words.

Skip pages with less sentences than ...

Some readability formulas have been conceived to be statistically meaningful when the text has a minimum number of sentences.
For example SMOG grade would require at least 30 sentences, even if it is often used with shorter texts.
With this option you can choose to exclude from the analysis pages having an extracted text shorted than a given number of sentences.

Column headers, general

URL

URL of the page containing the text analyzed.

Detected Language

Readability formulas are based on specific language statistics, and thus only make sense when applied to a text not in the language they are conceived for; for this reason the program provides a language detection check on the analyzed text.
When the selected language and the detected language do not check, the cell evidenced with a red background.

Detected language column in Readability Analysis
Detected language column in Readability Analysis

You can check analyzed text with the Plain text viewer on the right pane.

Keep in mind that language detection is not always exact, especially when the available text is not much: similar languages can be confused in short texts, and usage of foreign words can also cause false positives.
Some romance language short sentences can occasionally be mistaken for Latin. We could have improved accuracy a little by removing the ability to recognize Latin, but since there are so many temporary "Lorem ipsum" text forgotten out there on the Web, we decided to leave it and permit our users to detect them.

Language detection can be inaccurate. It needs at least five words to attempt a recognition, the more the better. Long texts can be recognized with greater precision, but take more computation time.
Since the program could potentially examine hundred of thousands or more texts, for performance sake only the first twenty words found within the extracted text are examined, which according to our tests is a good trade-off between performance and precision.

Title

The title of the page containing the text analyzed, read from the HTML title tag.

Column headers, English language

FRES - Flesch reading ease score

Flesch reading ease is probably the most known traditional readability formula.
The underlying idea is that text easier to read use short sentences and avoid using complex words. FRES is thus based on word length in syllables and sentence length in words.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

The table below explains FRES score ranges:

Score US School level Reading difficulty
90 to 100 5th grade
(10-11 year old)
Very easy to read. Easily understood by an average 11-year-old student.
80 to 90 6th grade
(11–12 year old)
Easy to read. Conversational English for consumers.
70 to 80 7th grade
(12–13 year old)
Fairly easy to read.
60 to 70 8th & 9th grade
(13-15 year old)
Plain English. Easily understood by 13- to 15-year-old students.
50 to 60 10th to 12th grade
(15-18 year old)
Fairly difficult to read.
30 to 50 College Difficult to read.
0 to 30 College graduate Very difficult to read. Best understood by university graduates.

FRES makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.
Most of all, it only makes sense when applied to an English text. It is meaningless if applied to other languages.

Flesch-Kincaid

The Flesch-Kincaid grade level formula is probably the second most known traditional readability formula after FRES. It owns part of its popolarity to its inclusion in older versions of MS Word.
The underlying idea is that text easier to read use short sentences and avoid using complex words. Flesch-Kincaid is thus based on word length in syllables and sentence length in words.

It produces an approximate representation of the US grade level needed to comprehend the text. The lower the grade level, the easier is the text to read.
If you are not familiar with US grade levels, the rule of thumb is: add 5 to the grade level, and you get the age of the grade level student.

The table below maps grade and reading levels:

School level Student age Reading difficulty
< 5th grade < 10 year old Extremely easy.
5th grade 10-11 year old Very easy to read. Easily understood by an average 11-year-old student.
6th grade 11–12 year old Easy to read. Conversational English for consumers.
7th grade 12–13 year old Fairly easy to read.
8th & 9th grade 13-15 year old Plain English. Easily understood by 13- to 15-year-old students.
10th to 12th grade 15-18 year old Fairly difficult to read.
13th to 16th grade
College
18-22 year old Difficult to read.
> 16th grade
College graduate
Ages vary Very difficult to read. Best understood by university graduates.

Flesch-Kincaid makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.
Most of all, it only makes sense when applied to an English text. It is meaningless if applied to other languages.

Gunning-Fog

The Gunning-Fog formula is a well known traditional readability formula.
The underlying idea is that text easier to read use short sentences and avoid using complex words, where "complex words" are assumed to be composed of three syllables or more. Gunning-Fog is thus based on the percentage of words with at least three syllables and sentence length in words.

It produces the Gunning-Fog Index, an approximate representation of the US grade level needed to comprehend the text. The lower the grade level, the easier is the text to read.
If you are not familiar with US grade levels, the rule of thumb is: add 5 to the grade level, and you get the age of the grade level student.

See the Flesch-Kincaid case for a table mapping grade and reading levels.

The Gunning-Fog Index makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.
Most of all, it only makes sense when applied to an English text. It is meaningless if applied to other languages.

SMOG

The SMOG formula is a traditional readability formula, conceived to evaluate the reading difficulty consumer-oriented healthcare material.
The underlying idea is that text easier to read use short sentences and avoid using complex words, where "complex words" are assumed to be composed of three syllables or more. SMOG is thus based on the percentage of words with at least three syllables over the number of used sentences.

It produces the SMOG grade, an approximate representation of the US grade level needed to comprehend the text. The lower the grade level, the easier is the text to read.
If you are not familiar with US grade levels, the rule of thumb is: add 5 to the grade level, and you get the age of the grade level student.

See the Flesch-Kincaid case for a table mapping grade and reading levels.

The SMOG grade makes statistically sense only for texts of at least 30 sentences, even if it is often used for shorter texts.
Most of all, it only makes sense when applied to an English text. It is meaningless if applied to other languages.

ARI - Automated Readability Index

The ARI formula is a traditional readability formula.
It is fast to compute because is based on word length in characters rather than syllables. The formula takes into account the number of sentences, words and characters.
The underlying idea is that text easier to read use short sentences and avoid using complex words. ARI is thus based on the text average word length in characters and average sentence length in words.

It produces the Automated Readability Index, an approximate representation of the US grade level needed to comprehend the text. The lower the grade level, the easier is the text to read.
If you are not familiar with US grade levels, the rule of thumb is: add 5 to the grade level, and you get the age of the grade level student.

In ARI the approximation to grade levels is distorted for higher values:

  • Values of ARI from 1 to 12 map the US grade levels.
    See the Flesch-Kincaid case for a table mapping grade and reading levels.
  • ARI = 13 is equivalent to the reading ability of a college student (18-24 year old).
  • ARI = 14 is equivalent to the reading ability of a Professor.

The Automated Readability Index only makes sense when applied to an English text. It is meaningless if applied to other languages.

Coleman-Liau

The Coleman-Liau formula is a traditional readability formula.
It is fast to compute because is based on word length in characters rather than syllables. The formula takes into account the number of sentences, words and characters.
The underlying idea is that text easier to read use short sentences and avoid using complex words. Coleman-Liau is thus based on the text average word length in characters and average sentence length in words.

It produces the Coleman-Liau Index, an approximate representation of the US grade level needed to comprehend the text. The lower the grade level, the easier is the text to read.
If you are not familiar with US grade levels, the rule of thumb is: add 5 to the grade level, and you get the age of the grade level student.

See the Flesch-Kincaid case for a table mapping grade and reading levels.

The Coleman-Liau Index makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.
Most of all, it only makes sense when applied to an English text. It is meaningless if applied to other languages.

Average grade

The Average Reading Level is the average of all scores that evaluate an English text returning an approximation of the US grade level needed to comprehend the text.
The indexes used are:

  • Automated Readability Index (ARI)
  • Flesch–Kincaid grade level
  • Gunning-Fog index
  • SMOG index
  • Coleman–Liau index

The aim is to provide an index that would smooth the approximation error of the individual indexes.

Column headers, Italian language

Vacca 72

The "Flesch - Vacca - Franchina (1972)" readability formula is an adaptation to the Italian language of the FRES ("Flesch reading ease") formula.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

Like FRES, it makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.

See the FRES case for a table comparing the index ranges to a qualitative description of the reading difficulty.

Vacca 86

In 1986 Vacca wrote a revision of his adaptation of FRES calibrating the formula weights in order to make the same formula return the same result when applied to the text of one book of his available both in Italian and English.
You can understand that the multilingual document corpus the "Flesch - Vacca - Franchina (1986)" is based upon is quite restricted (only one!); independend subsequent studies reputed the first Vacca 72 formula statistically more reliable to evaluate texts in Italian language.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

Like FRES, it makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.

See the FRES case for a table comparing the index ranges to a qualitative description of the reading difficulty.

Gulpease

The "Indice Gulpease" is a readability index calibrated on Italian language.
It is fast to compute because is based on word length in characters rather than syllables. The formula takes into account the number of sentences, words and characters.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.

The table explains the score ranges compared to the reader's education (Italian school system):

Score Primary education
(12+ year old)
Lower secondary education
(15+ year old)
Upper secondary education
(20+ year old)
95 to 100 Very easy
Independent reading level
Very easy
Independent reading level
Very easy
Independent reading level
80 to 95 Easy
Independent reading level
Very easy
Independent reading level
Very easy
Independent reading level
70 to 80 Hard
School reading level
Easy
Independent reading level
Very easy
Independent reading level
60 to 70 Very hard
School reading level
Easy
Independent reading level
Easy
Independent reading level
55 to 60 Very hard
Frustration level
Hard
School reading level
Easy
Independent reading level
50 to 55 Almost incomprehensible
Frustration level
Hard
School reading level
Easy
Independent reading level
40 to 50 Almost incomprehensible
Frustration level
Very hard
School reading level
Easy
Independent reading level
35 to 40 Almost incomprehensible
Frustration level
Very hard
School reading level
Hard
School reading level
30 to 35 Almost incomprehensible
Frustration level
Almost incomprehensible
Frustration level
Hard
School reading level
15 to 30 Almost incomprehensible
Frustration level
Almost incomprehensible
Frustration level
Very hard
School reading level
10 to 15 Almost incomprehensible
Frustration level
Almost incomprehensible
Frustration level
Very hard
Frustration level
0 to 10 Almost incomprehensible
Frustration level
Almost incomprehensible
Frustration level
Almost incomprehensible
Frustration level

As you can see the Gulpease index should be evaluated for three possible categories of readers with different education levels.

Column headers, French language

Kandel Moles

The "Kandel-Moles" readability formula is an adaptation to the French language of the FRES ("Flesch reading ease") formula.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

Like FRES, it makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.

See the FRES case for a table comparing the index ranges to a qualitative description of the reading difficulty.

Szigriszt (French version)

Szigriszt's research attempted to modify the FRES formula so that it would return more or less the same score for a series of texts translated in English, French and Spanish.
The result had been three formulas, one for each of the three languages. While the English version did not get much attentions, the other two had been used for French and Spanish.

The "Lisibilité de Szigriszt" readability formula is an adaptation to the French language of the FRES ("Flesch reading ease") formula.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

Like FRES, it makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.

See the FRES case for a table comparing the index ranges to a qualitative description of the reading difficulty.

Column headers, Spanish language

Huerta

The "Índice lecturabilidad Flesch-Fernández Huerta" readability formula is an adaptation to the Spanish language of the FRES ("Flesch reading ease") formula.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

Like FRES, it makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.

See the FRES case for a table comparing the index ranges to a qualitative description of the reading difficulty.

Szigriszt (Spanish version)

Szigriszt's research attempted to modify the FRES formula so that it would return more or less the same score for a series of texts translated in English, French and Spanish.
The result had been three formulas, one for each of the three languages. While the English version did not get much attentions, the other two had been used for French and Spanish.

The "Perspicuidad de Szigriszt" readability formula is an adaptation to the Spanish language of the FRES ("Flesch reading ease") formula.

It returns an index from 0 to 100: the higher the score, the easier is the text to read.
Values below 0 or above 100 are possible, and should be treated as if the score were 0 or 100; we decided to show them anyway for better clarity.

Like FRES, it makes statistically sense only for texts of about 100 words or longer, even if it is often used for shorter texts.

See the FRES case for a table comparing the index ranges to a qualitative description of the reading difficulty.

Column headers, generic counters

Words

The number of words found in the text analyzed within the page.

Sentences

The number of sentences found in the text analyzed within the page.

Characters

The number of characters found in the text analyzed within the page.