Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The user query is turned into an ElasticSearch query and applied across the cluster.
  • The number of documents returned from ElasticSearch is capped at a "large" number (default 1000, eg 10x the documents to return). The documents are ordered by their Lucene score (or optionally just by descending date).
  • Each returned document is then assigned a INF:Significance score as described below.
  • The significance and relevance scores are then normalized against each other based on a relative importance specified by the user (default 2:1 in favor of significance) and combined, with the mean score set to 100 (like the "+" stats in baseball, eg 120 is 20% higher than average).
  • The top scoring documents (TBD link to output/documents) or entities (TBD link to output/aggregation) are returned to the client.

...

It is beyond the scope of this documentation to go into much detail about significance, but this section provides a brief description (a 1-line summary is also provided below!):

  • Each document has a set of entities (TBD link to doc/entity document models)
  • For each entity,document pair, a TF-IDF score is generated, the entity's "significance" (TBD link). This score is adjusted in a number of ways:
    • Entities with low document counts have their significance suppressed by 33% (well below a dynamically calculated "noise floor") or 66% (just below/at the "noise floor")
    • When only a subset of the matching documents are returned (eg > 1000 documents), the significance is adjusted to estimate the TF-IDF across the entire matching dataset, not just the returned subset.
  • The document significance ("aggSignificance" TBD link) is the sum of the entity,document significances. This score can be adjusted in a number of ways:
    • Temporally, using the document's "publishedDate" field and a standard "decay algorithm"
    • Geo-spatially, similarly to the above but based on distance using the closest entity to the decay origin (lat,long).
  • Entities are also assigned a "datasetSignificance", which is just the average of the significances across all documents in which it appears.
    • Note that neither entity scores are currently adjusted for time or geo-spatial decay, though this will be added as an option in a future release.

...

Note that this parameter is also currently used to determine how many documents are used to generate the "event timeline" (TBD output/document section).

Code Block
languagejavascript
titleScoring parameters - significance/relevance weighting
{
	"score": {
		// See preceding sections for other parameters
		"sigWeight": number, // (default: 0.67)
		"relWeight": number, // (default: 0.33)
		// See following sections for other parameters
	}
}

...

"time" is the center point around which to decay.It has the same format as the "min" and "max" fields of the "time" query term, ie "now", Unix time in ms, or one of a standard set of date/time formats (ISO, SMTP, etc).

"decay" is the "half life" of the decay (ie the duration from "time" at which the score is halved). It is in the format "N[INF:dmwy]" where N is an integer and d,m,w,y denote "day", "month", "week" or "year" (eg "1w", "1m"; note currently if "m" is used, then the duration is always 1 month).

...

"ll" is the lat/long of the center point around which to decay. It has the same format as the "centerll"/"minll"/"maxll" fields of the geospatial query term, ie "lat,long" or "(lat,long)".

"decay" is the "half life" of the decay (ie the distance from "ll" at which the score is halved). It is in the same format as the "dist" field of the geospatial query term, ie in the format "<distance><unit>" where <distance> is an integer or floating point number, and unit is one of "m" (miles), "km" (kilometers), "nm" (nautical miles).

...