Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Scoring Overview

...

 1. User query is turned into an ElasticSearch query and applied across the cluster

 2. The documents are ordered by their Lucene score (or optionally just by descending date).

  • The number of documents returned from ElasticSearch is capped at a "large" number (Default is 1,000, i.e. 10x the documents to return)

3. Each returned document is then assigned a Significance score as described below.

 4. Significance and relevance scores are then normalized against each other based on the ratio specified in advanced options (default 2:1 in favor of significance) and combined, with the mean score set to 100

5. The top scoring documents or entities are returned to a query.


Significance

1. All entities identified within a document are assigned a significance score based on...

2. Each document is also assigned a significance score, an aggregate of all entity scores within

3. Entities are also assigned a "datasetSignificance", an average of the significance scores of all documents in which it appears

4. For each entity/document pair, a TF-IDF score is generated, the entity's "significance". This score is adjusted in a number of ways:

  • Entities with low document counts have their significance suppressed by 33% (well below a dynamically calculated "noise floor") or 66% (just below/at the "noise floor")

  • When only a subset of the matching documents are returned (i.e. > 1000 documents), the significance is adjusted to estimate the TF-IDF across the entire matching dataset, not just the returned subset

 

Relevance 

 Relevance measures how well a document matches your query, as opposed to significance which measures how well an entity matches your query


In summary: Relevance measures how well a document matches the user's query; Significance measures how well an entity matches the user's query; Document significance is simply the sum of the entity significances.