Overview
In order to understand the scoring parameters presented by the Infinit.e API, it is necessary to have a basic understanding of the query and scoring process:
- The user query is turned into an ElasticSearch query and applied across the cluster.
- The number of documents returned from ElasticSearch is capped at a "large" number (default 1000, eg 10x the documents to return). The documents are ordered by their Lucene score (or optionally just by descending date).
- Each returned document is then assigned a Significance score as described below.
- The significance and relevance scores are then normalized against each other based on a relative importance specified by the user (default 2:1 in favor of significance) and combined, with the mean score set to 100 (like the "+" stats in baseball, eg 120 is 20% higher than average).
- (All three scores are attached to the documents, as "queryRelevance", "aggregateSignif", and "score" respectively)
- The top scoring documents or entities are returned to the client.
Significance
It is beyond the scope of this documentation to go into much detail about significance, but this section provides a brief description (a 1-line summary is also provided below!):
- Each document has a set of entities
- For each entity,document pair, a TF-IDF score is generated, the entity's "significance". This score is adjusted in a number of ways:
- Entities with low document counts have their significance suppressed by 33% (well below a dynamically calculated "noise floor") or 66% (just below/at the "noise floor")
- When only a subset of the matching documents are returned (eg > 1000 documents), the significance is adjusted to estimate the TF-IDF across the entire matching dataset, not just the returned subset.
- The document significance ("aggSignificance") is the sum of the entity,document significances. This score can be adjusted in a number of ways:
- Temporally, using the document's "publishedDate" field and a standard "decay algorithm"
- Geo-spatially, similarly to the above but based on distance using the closest entity to the decay origin (lat,long).
- Entities are also assigned a "datasetSignificance", which is just the average of the significances across all documents in which it appears.
- Note that neither entity scores are currently adjusted for time or geo-spatial decay, though this will be added as an option in a future release.
Separate documentation will describe the scoring algorithms in more detail.
In summary: relevance measures how well a document matches the user's query, significance measures how well an entity matches the user's query (and document significance is simply the sum of the entity significances).
Scoring parameters
All scoring parameters are maintained under a "score" object under the top level query. The remainder of this section describes the "score" object's fields.
{ "score": { "numAnalyze": integer, // (default: 1000) "scoreEnts": boolean, // (default: true) // See following sections for other parameters } }
The "numAnalyze" parameter dictates the maximum number of documents to be returned from the Lucene (/ElasticSearch) query and analyzed according to the significance algorithm described above. The larger the number, the more accurate the results but the slower the query.
Empirically, the default of 1000, which takes 0.5-1 second has produced good results.
Note that this parameter is also currently used to determine how many documents are used to generate the "event timeline".
If "scoreEnts" is set to false (defaults to true if not present), then entities do not have significance scores generated.
- This can be useful in cases where documents are not being scored using significance either (see "sigWeight" field below) - in this case rather than retrieving "numAnalyse" documents, only "output.docs.numReturn" can be retrieved, which is much faster.
{ "score": { // See preceding sections for other parameters "sigWeight": number, // (default: 0.67) "relWeight": number, // (default: 0.33) "adjustAggregateSig": boolean, // (default: auto-decide, see below) // See following sections for other parameters } }
These two floating point numbers represent the relative weight of significance vs relevance (as described above). If they don't sum to 1, they are just divided by their sum.
Increasing the "sigWeight" field tends to return documents that are longer and don't necessarily strongly relate to the user's query; instead they will tend to return documents that discuss concepts particular to the query.
Increasing the "relWeight" field tends to return documents that are shorter and very strongly relates to the user's query.
- (Eg for a query on "american politics", The most significant documents would contain discussion of Obama, Palin etc; the most relevant documents would contain the words "american" and "politics" with high frequency compared to other words)
If one of the two weights is set to 0 then its score is neither calculated nor used.
If both weights are set to 0 then documents are ranked in descending date order and no scoring is performed.
If "adjustAggregateSig" is set to true (the default is described below), then aggregation significances (ie in entities and associations) are adjusted for the average relevances of the documents containing them. This is useful in cases where free text queries are used, eg "ftext": "bob smith" will match on documents only containing "bob" or only containing "smith", but with lower relevance scores - so entities contained in such documents should be weighted down accordingly.
By default, this is enabled automatically if "ftext" terms are used, and otherwise disabled. The reason for this is that in other cases when relevance may vary widely, eg chains of "etext" queries linked with ORs, it is not normally desirable to adjust the entities (in the example given, it is likely to be a list of aliases: scoring up documents that contain many aliases via the relevance term is fine, but there is no reason to score up the individual entities contained in the documents.)
{ "score": { // See preceding sections for other parameters sourceWeights: { string: double }, typeWeights: { string: double }, tagWeights: { string: double }, // See following sections for other parameters }
The 3 fields listed above allow users to adjust document scores manually. This can be useful when, for example, a source that is known to contain relatively low intelligence documents scores high because each document has many common entities, particularly when those entities are unique to the source (hence queries involving that source will tend to give those entities a higher significance).
The weights are in formats like this:
{ //.. "sourceWeights": { "washingtondc.IncidentReport": 0.5, // halve scores from the source with this key (source.key) "arstechnica.com.tech-policy.2012.10.last-android-vendor-f.147.4.": 1.5 // increase documents from this source by 50% } "typeWeights": { "Record": 2.0, // Any documents of source.type "Record" that don't match a source weight "Report": 0.75, //(etc) } "tagWeights": { "database": 1.5, // If a document has this tag, multiply its total score by 1.5 "largeReport": 0.25 // (averaged across all matching tags) } //... }
The weights are applied as follows:
- First the source weights are applied.
- If no source weight matches the document, then the type weights are applied.
- If no type weight matches the document, then a tag weight is generated by averaging all matching entries from the "tagWeights" map.
- If no weight matches the document, then its total score is preserved.
{ "score": { // See preceding sections for other parameters "timeProx":{ "time": string, "decay": string }, // See following sections for other parameters } }
"time" is the center point around which to decay.It has the same format as the "min" and "max" fields of the "time" query term, ie "now", Unix time in ms, or one of a standard set of date/time formats (ISO, SMTP, etc).
"decay" is the "half life" of the decay (ie the duration from "time" at which the score is halved). It is in the format "N[dmwy]" where N is an integer and d,m,w,y denote "day", "month", "week" or "year" (eg "1w", "1m"; note currently if "m" is used, then the duration is always 1 month).
{ "score": { // See preceding sections for other parameters "geoProx":{ "ll": string, "decay": string } } }
"ll" is the lat/long of the center point around which to decay. It has the same format as the "centerll"/"minll"/"maxll" fields of the geospatial query term, ie "lat,long" or "(lat,long)".
"decay" is the "half life" of the decay (ie the distance from "ll" at which the score is halved). It is in the same format as the "dist" field of the geospatial query term, ie in the format "<distance><unit>" where <distance> is an integer or floating point number, and unit is one of "m" (miles), "km" (kilometers), "nm" (nautical miles).
{ "score":{ "numAnalyze": 1000, "sigWeight": 0.67, "relWeight": 0.33, " "timeProx": { "time": "now", "duration": "1m" } } }