Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

In order to understand the scoring parameters presented by the Infinit.e API, it is necessary to have a basic understanding of the query and scoring process:

 1. User query is turned into an ElasticSearch query and applied across the cluster

 2. The documents are ordered by their Lucene score (or optionally just by descending date).

  • The number of documents returned from ElasticSearch is capped at a "large" number (Default is 1,000, i.e. 10x the documents to return)

3. Each returned document is then assigned a Significance score as described below.

...

Scorning describes the process whereby Infinit.e processes queries and returns documents to the Infinit.e UI and the widgets.

When a user performs a query, an objet is returned which includes the documents and their sub-objects, such as entities, associations and aggregations.  As part of the document object there are three query enrichment parameters which describe relevance, significance and overall score.

Aggregate significance describes how well the user's query matches the entities in the source documents.

Query Relevance describes how well a document matches the user's query.

The overall Score is a combined normalized significance/relevance score.  The top scoring documents are returned to the widgets.

Gliffy
nameScoring


 

Significance and Relevance

When Infinit.e scores documents it weights significance and relevance scores against each other using the "sigWeight" and "relWeight" parameters of the score object.  This weighting can be adjusted using the advanced options as a ratio (default 2:1 in favour of significance) and combined, with the mean score set to 100

...

(Similar to the "+" stats in baseball, i.e. 120 is 20% higher than average

...

All three scores are attached to the documents, as "queryRelevance", "aggregateSignif", & "score" respectively 

5. The top scoring documents or entities are returned to a query.

Significance

1. All entities identified within a document are assigned a significance score based on...

2. Each document is also assigned a significance score, an aggregate of all entity scores within

3. Entities are also assigned a "datasetSignificance", an average of the significance scores of all documents in which it appears

4. For each entity/document pair, a TF-IDF score is generated, the entity's "significance". This score is adjusted in a number of ways:

  • Entities with low document counts have their significance suppressed by 33% (well below a dynamically calculated "noise floor") or 66% (just below/at the "noise floor")

  • When only a subset of the matching documents are returned (i.e. > 1000 documents), the significance is adjusted to estimate the TF-IDF across the entire matching dataset, not just the returned subset

...

Relevance 

 Relevance measures how well a document matches your query, as opposed to significance which measures how well an entity matches your query

...

).

Increasing the "sigWeight" field tends to return documents that are longer and don't necessarily strongly relate to the user's query; instead they will tend to return documents that discuss concepts particular to the query.

 

Increasing the "relWeight" field tends to return documents that are shorter and very strongly relates to the user's query.

 

For example, for a query on "american politics", the most significant documents would contain discussion of Obama, Palin etc; the most relevant documents would contain the words "american" and "politics" with high frequency compared to other words.

Todo

-entity vs. doc significance

-significance can be impacted by temporal and geo-spatial


 

 

 


 


 

Panel

Related Documentation: