Indexiness us a numeric score that helps to distinguish between Index and Leaf-pages. Where index pages are navigation heavy and the leaf pages have some content. The score is mostly derived from a pages link elements.
A negative indexiness score means that a document is a leaf page (not very indexi), a positive score means that the document is a navigation page. A higher absolute score correlates with how confident the algorithm is that it is correct. The absolute score height also correlates with the number of links on a page.
Relevant Data-Structures
What makes a Page an Index or a Leaf?
The following are some stereotype criteria for each page-type:
Index pages have:
- Less outlinks
- No publishing date
- Inlinks in lists
- Inlinks in headers
Leaf pages have:
- More outlinks
- Selflinks in headers (to get link-able headers)
- Selflinks in lists (table of contents)
- A publishing date
There are other criteria too, like the paragraph size/count that readability.js uses, but the above seem enough to classify most pages correctly.
Metrics
To calculate the indexiness the following metrics have to be known about a document.
Metric | How to get |
---|---|
has_publishing_date | Did the crawler find a publishing date in the page metadata? |
selflinks_in_headers | How many selflinks either are in a headline or contain a headline? |
selflinks_in_lists | How many selflinks are in some kind of list? |
inlinks_in_headers | How many inlinks either are in a headline or contain a headline? |
inlinks_in_lists | How many inlinks are in some kind of list? |
number_of_outlinks | How many outlinks are on the page? |
Counting Links
For being relevant indexness a link must fullfill all of the following criteria:
- It is a link intended for document navigation (HTML
a
-tag) - Not
in_nav
- Not
in_aside
- One of:
in_main
in_article
- Neither
in_footer
orin_header
Each relevant link only gets counted once, being relevant for the _in_headers
metrics beats being relevant for the _in_lists
metrics if both would apply.
Note: the in nav/aside requirement might be lifted for selflinks to not penalize correctly marked up tables of content.
Metrics to Indexiness
+
+
+
+
+
if has_date_published else +
-10
The metrics are weighted by how strong of an indicator they are compared to the weakest indicator.
The headstart of -10
is chosen so that the page starts out relatively leafy, but one headline-inlink with no additional clues makes the score barely positive.
Known Misclassifications
Inside Rust Blog, table based index (found: 2024-08-24)
https://blog.rust-lang.org/inside-rust/
The index page makes use of tables, which are not recognized by the current algorithm iteration.
A quickfix would be to consider links in tables equivalent to links in lists, though this will clash with other usage modes of tables.
Another way of dealing with this is comparing the amount of table links to the total mount of relevant links and only apply that metric if a significant amount of links is inside tables. This threshold should be pretty high though.
Note that tables are also often used on leafy pages for inlinks to other pages:
- https://www.freedesktop.org/software/libqmi/libqmi-glib/latest/libqmi-glib-DMS-Set-FCC-Authentication-response.html (gtk-doc genereated, current correct negative indexiness)
- https://wiki.postmarketos.org/wiki/Qualcomm_Snapdragon_415/615/616_(MSM8929/MSM8939) (Table-heavy wiki page, current semi-correct indexiness of 6)
Blogger, multiple issues with blog entries fully replicated on index page (found: 2024-10-20)
https://www.righto.com/2019/
The index pages basically have the full articles content here, including an <a href="https://static.…"><img …></a>
construct. The links that make the images interactive shouldn't affect the indexiness algorithm and should be treated as part of the images.
The alt text of the img
tags doesn't get counted as text on the link wrapped around the images, so these are basically empty.
Quickfix would be ignoring links without text on them.
Note: After applying the quickfix the page still gets misclassified as a leaf-page with a score of -279 (instead of roughly -380 before).