Scoring

Scoring

When a user performs a query, an objet is returned which includes the documents and their sub-objects, such as entities, associations and aggregations.  As part of the document object there are three query enrichment parameters which describe relevance, significance and overall score.

Aggregate significance describes how well the user's query matches the entities in the source documents.

Query Relevance describes how well a document matches the user's query.

The overall Score is a combined normalized significance/relevance score.  The top scoring documents are returned to the widgets.

Significance and Relevance

When IKANOW scores documents it weights significance and relevance scores against each other using the "sigWeight" and "relWeight" parameters of the score object.  This weighting can be adjusted using the advanced options as a ratio (default 2:1 in favour of significance) and combined, with the mean score set to 100.  This is similar to the "+" stats in baseball, where 120 is 20% higher than average.

Increasing the "sigWeight" field tends to return documents that are longer and don't necessarily strongly relate to the user's query; instead they will tend to return documents that discuss concepts particular to the query.  Increasing the "relWeight" field tends to return documents that are shorter and very strongly relates to the user's query.

For example, for a query on "american politics", the most significant documents would contain discussion of Obama, Palin etc; the most relevant documents would contain the words "american" and "politics" with high frequency compared to other words.

For more information, see section Weightings.

Entity Significance and Document Significance

In queries, documents and entities are closely related and many of the metrics that populate the widgets are driven by the relationships between these objects.

For example, "Query Coverage" is a percent calculation that measures the percent of matching documents (within the dataset) in which the queried entity appears.  Query Count (approx), as seen on the Query Metrics widget, provides the approximate number of documents that match an entity query.

Significance measures a relationship between an entity and its associated document.  It is a measurement based on TF-IDF which is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.  The document significance, on the other hand, sums all of the entity/document significance measurements to arrive at Aggregate significance.  This is the metric that is used to determine document scoring.  It is also possible to analyse query significance (%) which measures the TF-IDF score of the entity for a specific query.

 "MaxDoc Significance" measures the  % of times the entity occurs in the documents in comparison to the other entities taken together as an average, while "MaxDoc Frequency" measures the most number of times within a single document that the entity occurred.  These metrics can be seen on the Entity Significance widget.

Time Proximity and Geospatial Proximity

Significance can also be made dependent on time or geo factors.  For example, you might want to give less "weight" to entities that are spatially remote form a specific bounding box, or events that occurred significantly outside of a specific date range.  The "timeProx" and "geoProx" elements of the "score" object is used for this purpose.

For more information, see section Weightings

In this section:


 

Related Community Edition Documentation:

Event Timeline

Entity Significance

Query Metrics

Weightings