Knowledge - Query - Query Terms

Overview of Query Terms

Within the top level query JSON object there is a query field "qt" that is an array of query term objects. Query term objects are described below and allow the following query types:

In this section:

These terms can then be combined in an arbitrary boolean expression (with the operators AND, OR, NOT and parentheses) using the (case insensitive) "logic" field of the top level object, where the different terms are denoted by their index in the array (counting from 1). For example:

 

Example top level query
{
	"qt": [ { term1 }, { term2 }, { term3 }, { term4 } ],
	"logic": "1 AND (2 OR 3) AND NOT 4"
}

In the above example each of term1-term4 is one of the objects described below.

If the logic term is set to null or not present, it defaults to ANDing all the terms together.

In the "dot notation" used to represent query objects such as URL parameters in "GET" requests, the different "qt" terms are represented as "qt[0]", "qt[1]", etc (ie indexed from 0, unlike the "logic" string).

Finally, note that the combination of "qt" and "logic" can be replaced by the "raw" object described at the bottom of this page, which gives the user access to the raw ElasticSearch query API. If both "qt"/"logic" and "raw" are present, the "qt"/"logic" fields are ignored.

Field Guide

Exact Text

The exact text query term object has the following format:

Free text format
{
	"etext": string
}

The "etext" string is a phrase that must match exactly somewhere in the document (in any of the text fields). There is one exception: if "*" is the "etext" field then it matches all documents.

For example (using dot notation), qt[0].etext="barack obama" will match on documents containing "barack obama" but not documents containing only (eg) "president obama", "barack 'barry' obama" etc.

Free Text

The free text query term object has the following format:

Free text format
{
	"ftext": string
}

The "ftext" field represents an arbitrary Lucene query (Lucene syntax, including elasticsearch extensions and modifications). By default, all text fields in the document (including its entities and events; link to the document format) are included in the query, though the standard "field:text" syntax can be used.

In addition to the Lucene and Elasticsearch extensions, there are the following Community Edition extensions:

  • "$cache": If points to the title or "_id" of a saved query, then will return the results of that saved query. Currently only works if it is the only query element (otherwise the call will error).

Examples

 { "qt": [ "ftext": "barack obama" ] } will match on any documents containing either "barack" or "obama", with documents containing both scored more highly.

 { "qt": [ "ftext": "+barack +obama" ] } requires both be present (but not necessarily in the same phrase)

 { "qt": [ "ftext": "'barack obama'" ] } is equivalent to the "etext" query described above.

 { qt[0].ftext="+obama -palin": documents containing the word "Obama" but not containing the word "Palin"

 { qt[0].ftext="title:\"palin\"": documents with the word "Palin" in the title.

Special Characters

When using dot notation and typing queries directly into the URL bar that characters like '+' must be double-URL-encoded, eg to %25%32%4B (ie via %2B from +).

Saved Queries

Saved queries generated either directly from the API, or via the plugin manager GUI can be accessed via the API using the "ftext" query, simply create an ftext query with the custom job's name or _id prefixed by "$cache", eg:

{
	"ftext": "$cache:MY_SAVED_JOB"
}

Entities

The entity query term object has the following 2 possible formats:

Entity format
{
	// EITHER
	"entity": string,
	//OR
	"entityValue": string, // (entityValue is mandatory, entityType is optional)
	"entityType": string,
	// AND OPTIONALLY
	"entityOpt": { // (optional, see below for demos)
		"expandAlias": boolean, // (optional, defaults to false if not present)
		"rawText": boolean // (optional, defaults to false if not present)
	},
	"sentiment": { // (optional, specify one or both of min/max, see below)
		"min": number, 
		"max": number
	}
}

entityValue, entityType

In the first instance the "entity" string is in the format "entityValue/entityType" (this is its "index" form, eg "index" in the Entity JSON object).

In the second, decomposed, instance either of "entityValue" or "entityType" can be left out (if "entityValue" is left out this would match on all entities of a given type; if entityType is left out, it would match on all entity names regardless of the type).

entityOpt

The optional "entityOpt.expandAlias" boolean term will allow matching not just on the entity but also on common, automatically extracted, "aliases". This will tend to have the effect of matching on more documents, some of which will be false positives however. This query type is also slower.

Note this is different to manual entity aliasing, described here.

The optional "entityOpt.rawText" boolean term adds the entity's disambiguated name as an exact text query - this can be useful when some sources have low quality entity extraction (eg are in foreign languages, or are in list format etc), since any instance of the name appearing in a page will result in that page's selection.

 Sentiment

Entity queries can be combined with sentiment, using the "sentiment" json described above. If the "entity" or "entityValue" field is specified, then only documents containing that entity with a sentiment field that exists and is in the specified range. If neither of the text fields are specified then only documents containing 1+ entities with sentiment are selected. 

Entity Examples

Some examples:

  • qt[0].entity="facebook/company": will match on documents containing references to the company Facebook, but not the technology.
  • qt[0].entityValue="facebook"&qt[0].entityType="company": equivalent to the above
  • qt[0].entityValue="facebook": will match on both uses of the term Facebook
  • { "qt": [ { "entity": "barack obama/president", "entityOpt": { "expandAlias": true } } ] }: will match on documents containing references to Barack Obama, but also other common text strings such as "Barry Obama", "President Obama" etc.

Geospatial

The geospatial query term has the following possible formats:

Geospatial format
{
	"geo": {
		"centerll": string,
		"dist": string,
		"ontology_type": string // optional, see below
	}
}
//or
{
	"geo": {
		"minll": string,
		"maxll": string
		"ontology_type": string // optional, see below
	}
}

In the first case, the user is specifying the center latitude ("centerll") and longitude pair and radius ("dist") of a circle.

In the second case, the user is specifying a bounding box via the "minll" (lowest lat and long values ie the "bottom left") and "maxll" (highest lat and long values, ie the "top right").

In all cases the lat/long values are represented as strings either as "(<lat>,<long>)" or "<lat>,<long>" (ie the same but without parantheses).

dist

The "dist" string is a distance in the format "<distance><unit>" where <distance> is an integer or floating point number, and unit is one of "m" (miles), "km" (kilometers), "nm" (nautical miles).

ontology type

In both cases, an optional "ontology_type" can be specified. If it is specified, then entities with a higher "ontology_type" are ignored: 

If the user specifies the "ontology_type" in the geo query, then only strictly "smaller" types will be searched (eg if countrysubsidiary is specified then only city and point types will match). geographicalregion counts as being at the same level as continent for the purpose of this heuristic.

See geo discussion for more details.

Examples

  • qt[0].geo.centerll="40.12,-71.34"&qt[0].geo.dist="100km": within 100km of the specified lat/long.
  • { "qt": [ { "geo": { "centerll": "40.12,-71.34", "dist": "100" } } ] }: uses the default unit (km), ie is the same query as above.
  • qt[0].geo.minll="(4.1,-171.34)"&qt[0].geo.maxll="40.12,-71.34": bounding box showing lat/long format with and without parantheses.

Temporal 

The temporal query term has the following format:

Temporal format
{
	"time": {
		"min": string,
		"max": string
	}
	// AND OPTIONALLY
	"entityOpt": { // (deprecated)
		"lockDate": boolean
	}
}

One of "min" and "max" must be specified. If one of them is not specified, time is not bounded in that direction (eg if "min" is not specified then it means "all times before max"; if "max" is not specified then it means "all times after min").

The date fields are both strings and support a number of different formats:

  • "now" which always resolves to the current time, 
  • "now-XXX" where XXX can be a standard qualifier like "1d", "4d", "1w", etc: a period relative to now.
  • "midnight" which always resolves to midnight of the previous day
  • "midnight-XXX" where XXX can be a standard qualifier like "1d", "4d", "1w", etc: a period relative to last midnight.
  • any Unix time (ie milliseconds after "Jan 1 00:00:00 1970"), 
  • and the following date/date-time formats: "yyyy'-'DDD", "yyyy'-'M'-'dd", "yyyyMMdd", "dd MMM yyyy", "dd MMM yy", "MM/dd/yy", "MM/dd/yyyy", "MM.dd.yy", "MM.dd.yyyy", "dd MMM yyyy hh:mm:ss", "yyyy-MM-dd" (ISO Date), "yyyy-MM-ddZZ" (ISO Date-Timezone", "yyyy-MM-dd'T'HH:mm:ssZZ" (ISO DateTime-Timezone), "EEE, dd MMM yyyy HH:mm:ss Z" (SMTP DateTime).

The following option is deprecated but is used by the UI - the extended syntax described above is preferred.

If "entityOpt.lockDate" is set then the max time is fixed to "now", and the min time is adjusted to keep the period constant. To fix the min time instead of the max time, simply reverse min/max in the query term. If max is not specified, then the min time is set to "now" instead - eg "all future dates"

Examples:

  • { "qt": [ { "time": { "min": "1284666757164", "max": "now" } } ] }: from 16 Sep 2010 until now.
  • qt[0].time.min="now": any time in the future.
  • qt[0].time.max="20100201": any time before 1 Feb 2010.
  • { "qt": [ { "time": { "min": "02/10/2000", "max": "10 Feb 2001 13:00:00" } } ] }: from 10 Feb 2000 until 10 Feb 2001 at 1pm.

Associations

The association query format is slightly more complex than the others. It is also slightly more limited.

The association format is as follows:

Event format
"assoc": {
	"entity1": { ... }, // the "subject"; can be ftext, etext, or entity/entityValue/entityType query terms
	"entity2": { ... }, // the "object"; can be ftext, etext, or entity/entityValue/entityType query terms

	"verb": string,

	"geo": { ... }, // geo query term
	"time": { ... }, // time query term

	"type": string // "Event", "Fact", or "Summary"
},
"sentiment": { // (optional, specify one or both of min/max, see below)
	"min": number, 
	"max": number
}

As can be seen from the above code block, the association query term is a composite of other query term types (free text, exact text and entity terms for "entity1" and "entity2"; also temporal and geospatial).

Some things to note while performing entity queries:

  • The "entity1" field is processed as follows:
    • "ftext" and "etext" terms are applied across both the "entity1" and "entity1_index" fields within the entity object.
    • entity/entityValue/entityType terms are only applied to the "entity1_index" field
  • The "entity2" field is processed analogously 
  • The "verb" string is applied as an exact text query to the "verb_category" field and a free text query to the "verb" field within the association object.
  • For events with a time range ("time_start" and "time_end" fields), any part of the event time range can match the "time" term.
  • The difference between "Events", "Facts" or "Summaries" is described here (see "assoc_type").
  • If multiple terms are specified then these are ANDed together. There is currently no way of performing more complex boolean equations on individual events (obviously multiple event query terms can be specified and match across all events within a document).
  • If sentiment is specified then only documents containing associations with a sentiment field (this is somewhat rare) that exists and is in the selected range are selected.
  • Event queries with multiple terms can be a bit slower than other queries (due to its implementation in ElasticSearch).

     

Examples

Example event queries
// Any fact in which Barack Obama is the subject:
{
	"assoc": {
		"entity1": {
			"entity": "barack obama/person"
		},
		"type":"Fact"
	}
}
// Travel associations involving Sarah Palin:
{
	"assoc": {
		"entity1": {
			"entityValue":"sarah palin",
			"entityType":"person"
		},
		"verb": "travel",
	}
}
// Events in the future:
{
	"assoc": {
		"time": {
			"min": "now"
		},
		"type":"Event"
	}
}

 

Combining Query Terms

Multiple query terms can be combined in 2 ways:

  • Using the "logic" field as described above under Overview of Query Terms. This is the standard way of combining separate queries.
  • In addition, within a single query term multiple elements of different types can be merged into a single object - this has the effect of ANDing them together. For example:
    • { "qt": [ { "entity": "barack obama/person", "time": { "min": "1284666757164", "max": "now" } } ] }: documents containing the entity Barack Obama, from 16 Sep 2010 until now.
    • qt[0].etext="apple"&qt[0].ftext="pair": this is equivalent to qt[0].etext="apple"&qt[1].ftext="pair"&logic="1 and 2"

Raw ElasticSearch Queries

At present ElasticSearch is used as the front end of the search engine. 

The system provides a "passthrough" interface for full (or "raw") ElasticSearch queries.  This capability is for advanced use only, and should be avoided where possible.

The ElasticSearch API and Query DSL is described in detail on their web-site, and it is beyond the scope of this documentation to go into any more details. Only the "search" call and the "facet" call (accessed from the "aggregation" object) are available.

The search call can be placed inside the "raw" object from the top level as follows:

ElasticSearch passthrough syntax and example
// Syntax:
{
	"raw": {
		// Put fields and objects from the top level ElasticSearch "query" object here
	}
}
// Example:
{
	"raw": {
		"match_all": {}
	}
}

Some things to be aware of when making raw queries:

  • "raw" queries overrides other query terms - so only the "raw" query is performed.
  • All other aspects of the query are still from the URL/JSON Object, ie:
    • Only documents from the specified communities and the specified inputs are processed
    • Once documents matching the "raw" query have been retrieved by the server, they are ranked according to the "score" object
    • The format and number of documents returned to the client are determined by the "output" object.
  • It is not currently possible to specify both "raw" queries and "raw" aggregations (facets) within a single query.

 

Related Documentation:

ElasticSearch query API

Lucene syntax

Related User Documentation:

Plugin Manager