Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Scoring Overview

In order to understand the scoring parameters presented by the Infinit.e API, it is necessary to have a basic understanding of the query and scoring process:

 1. User query is turned into an ElasticSearch query and applied across the cluster

 2. The documents are ordered by their Lucene score (or optionally just by descending date).

  • The number of documents returned from ElasticSearch is capped at a "large" number (Default is 1,000, i.e. 10x the documents to return)

3. Each returned document is then assigned a Significance score as described below.

 4. Significance and relevance scores are then normalized against each other based on the ratio specified in advanced options (default 2:1 in favor of significance) and combined, with the mean score set to 100

5. The top scoring documents or entities are returned to a query.


Significance

1. All entities identified within a document are assigned a significance score based on...

2. Each document is also assigned a significance score, an aggregate of all entity scores within

3. Entities are also assigned a "datasetSignificance", an average of the significance scores of all documents in which it appears

4. For each entity/document pair, a TF-IDF score is generated, the entity's "significance". This score is adjusted in a number of ways:

  • Entities with low document counts have their significance suppressed by 33% (well below a dynamically calculated "noise floor") or 66% (just below/at the "noise floor")

  • When only a subset of the matching documents are returned (i.e. > 1000 documents), the significance is adjusted to estimate the TF-IDF across the entire matching dataset, not just the returned subset

 

Relevance 

 Relevance measures how well a document matches your query, as opposed to significance which measures how well an entity matches your query


In summary: Relevance measures how well a document matches the user's query; Significance measures how well an entity matches the user's query; Document significance is simply the sum of the entity significances. 


  • No labels