Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Scoring Overview

In order to understand the scoring parameters presented by the Infinit.e API, it is necessary to have a basic understanding of the query and scoring process:

  • User query is turned into an ElasticSearch query and applied across the cluster

  • The documents are ordered by their Lucene score (or optionally just by descending date).

    • The number of documents returned from ElasticSearch is capped at a "large" number (Default is 1,000, i.e. 10x the documents to return)

  • Each returned document is then assigned a Significance score as described below.

  • Significance and relevance scores are then normalized against each other based on the ratio specified in advanced options (default 2:1 in favor of significance) and combined, with the mean score set to 100

  • The top scoring documents or entities are returned to a query.


Significance

  • All entities identified within a document are assigned a significance score based on...

    • Entities with low document counts have their scores suppressed by 33% or 66%

    • When only a seb

  • Each document is also assigned a significance score, an aggregate of all entity scores within.

  • Entities are also assigned a "datasetSignificance", an average of the significance scores of all documents in which it appears 

  • For each entity/document pair, a TF-IDF score is generated, the entity's "significance". This score is adjusted in a number of ways:

    • Entities with low document counts have their significance suppressed by 33% (well below a dynamically calculated "noise floor") or 66% (just below/at the "noise floor")

    • When only a subset of the matching documents are returned (i.e. > 1000 documents), the significance is adjusted to estimate the TF-IDF across the entire matching dataset, not just the returned subset

 

Relevance 

  • Relevance measures how well a document matches your query, as opposed to significance which measures how well an entity matches your query


In summary: Relevance measures how well a document matches the user's query; Significance measures how well an entity matches the user's query; Document significance is simply the sum of the entity significances.