Knowledge - Query - Output options
Output Format
{ "output": { "format": string, "docs": { ... }, "aggregation": {...}, "filter": {...} } }
Field Guide
Output Format
{ "output": { "format": string, // "json" (default), "xml", or "rss" } }
RSS
Integrating REST authentication with RSS readers is a known problem, and Community Edition currently provides a few different options:
If the query is made from a browser that is already logged in (eg. the "RSS" button from the GUI) or with a cookie obtained from login then it works as normal.
(The cookies lifetime is only 30 minutes (server-side configurable), so this is not a viable long term option eg for use in RSS readers.)
Clear text or encrypted password:
It is possible to use "user="
and "pword="
URL parameters. The password can either be clear text (not recommended), or SHA-256 and Base-64 encoded (this site can be used for testing).
- This enables users to generate arbitrary queries and store them in RSS readers, at the expense of showing a password that can at best be used by others in any other REST function.
- (A future version of the tool will force these queries to be over SSL, to mitigate the risk this presents somewhat.)
Query-Specific Key:
A key for a specific query can be generated from the GUI, and then this query can be used without any authentication at all.
- This final authentication bypass is terrible for a few reasons (eg the key is only protected by "security through obscurity"!), and it is only a temporary solution.
- (A future version will provide a "license key" that will be usable only for RSS, to replace this.)
When requesting RSS via a key, you will need to supply the user's community ID as the first ID in the communityIds json call e.g. {"communityIds":["USER_ID","ALL_OTHER_COMMS"]}
This is to ensure a user has access to the communities you supply since the API call is being made unauthenticated.
For more information, see API Reference
Examples:
Examples of using RSS with various authentication methods:
- Username and clear text password (type into the URL):
http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?input.tags=
"topic:news"&output.format="rss"&user=email@domain.com&pword=m1ckeym0use
- Username and encrypted password (type into the URL):
http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?input.tags=
"topic:news"&output.format="rss"&user=email@domain.com&pword=u6JP3GVV5PSwvU4/1eFUj+kUPsAWhKa0eOqCFOrDyNQ=
Query-specific key:
curl -XPOST 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?key=GcC61AotZObZYxTgdKW9pbFTWdpCk8EkSIzQlCM3XDM' -d'{ "qt": [ { "ftext": "*" } ], "input": { "tags": [ "topic:news" ] },"output": { "format": "rss" }}'
Note finally that RSS (unlike XML and JSON) only provides the document URLs, none of the metadata.
Document Formats
{ "output": { "docs": { // Whether to return documents/how many: "enable": boolean, // (defaults to true) "numReturn": integer, // (defaults to 100, maximum is 10K - not advised unless all scoring is turned off) "skip": integer, // (defaults to 0) // Alternative/complement to documents: "eventsTimeline": boolean, // (defaults to false) "numEventsTimelineReturn": integer, // (defaults to 1000) // Which sub-objects to return per document: "ents": boolean, // (all of these default to true) "geo": boolean, "events": boolean, "facts": boolean, "summaries": boolean, "metadata": boolean } } }
Controlling the Number of Documents Returned
As described in the section on scoring, documents are sorted according to a scoring algorithm, and are retrieved in order. The "numReturn" field dictates how many are returned to the user, and the "skip" field allows a primitive concept of paging (eg "?output.docs.numReturn=10&skip=0"
,"?output.docs.numReturn=10&skip=10"
,"?output.docs.numReturn=10&skip=20
", etc).
There are two reasons to limit the number of documents (we think 100 works quite well within general visualization GUIs):
- For performance reasons
- The point of the Infinit.e tool is to extract "knowledge" for a corpus of documents, therefore cluttering the display with a large number of documents could be viewed as counter productive.
If "&output.docs.enable=false
" then no documents are returned.
Controlling the Format of Documents Returned
The document format is described here. As can be seen there, documents have a number of sub-objects: entities, events (which are then sub-divided into "Events", "Facts", and "Summaries"), and source-specific metadata. (See Source Pipeline Documentation and the documentation on Metadata, Entities, and Associations for more information.
The "ents
", geo
", "events
", "facts
", "summaries
", and "metadata
" fields are simply booleans that control whether these sub-objects are included. The main reason for not including them is just to avoid cluttering up and slowing down requests where they are not needed.
Note that "geo
" controls whether entities with (lat,long)s are included - eg for geospatial apps it maybe that most entities are not of interest, but geotagged ones are: in this case the pairing "&ents=false&geo=true
" would be used.
Events Timeline
The most significant document parameter is "output.docs.eventsTimeline". This generates a new output array, consisting of the "event" sub-objects (for "Events", "Facts", and "Summaries").
Events are taken from the top "score.numAnalyze" matching documents - the "top" specified documents are returned, where "top" is based on the Pythagorean sum of the documents containing each event. Note that "score.numAnalyze" currently controls two other important output components:
- How many documents are analyzed and scored (using a combination of Lucene/Significance) to determine which ones to return as part of the document output.
- How many entities are analyzed are scored (Significance only) to determine which ones to return as part of entity aggregations.
The fields "output.docs.events", "output.docs.facts", "output.docs.summaries" control which of "Events", "Facts" and "Summaries" are included in the timeline.
Example query/output (link to JSON format specification):
//curl -XGET 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?qt[0].etext="*"&input.tags="topic:technology"&output.docs.enable=false&output.docs.eventsTimeline=true' { response: { ... } stats: { ... } eventsTimeline: [ { entity1: "ev solar carport" entity1_index: "ev solar carport/facility" verb: "deliver" verb_category: "generic relations" entity2: "125 mw hours" event_type: "Summary" time_start: "2011-05-26" assoc_sig: 126.6919059017276, doccount: 1 }, { entity1: "aol advertising.com group" entity1_index: "aol advertising.com group/company" verb: "include" verb_category: "generic relations" entity2: "advertising.com" entity2_index: "advertising.com, inc./company" event_type: "Fact" time_start: "2011-06-01" assoc_sig: 126.6919059017276, doccount: 4 time_end: "2011-06-27" }, //etc ] }
"Events" and "Summaries" with the same time range (ie "time_start", "time_end" pair) are aggregated, with "doccount" used to store the sum. For "Facts", the "time_start" and "time_end" are set to the newest and oldest dates in which the "Fact" occurs (ie this may give some useful time range over which it is being discussed), with "doccount" counting all instances of the "Fact" within that time range.
Aggregation Formats
Format
{ "output": { "aggregation": { // Geo-spatial/temporal aggregations (all of these default to 0): "geoNumReturn": integer, "timesInterval": string, // Entities, events, facts: "entsNumReturn": integer, "eventsNumReturn": integer, "factsNumReturn": integer, // "Moments", temporal aggregation for entities: moments: { ... } // Source information: "sources": integer, "sourceMetadata": integer // (includes both tags and types) // Raw ElasticSearch "facets": "raw": string // (see below) } } }
The configuration of aggregation outputs is relatively simple, but this section also covers the different output formats:
- Geo-spatial
- Temporal
- Entities
- Events and facts
- Moments
- Sources and source metadata
- Raw access to ElasticSearch "facets"
Note that a (near-) future release will provide a more generic and powerful aggregation interface, allowing various document, entity, event, and metadata properties to be aggregated over time, space, and frequency.
Geo-spatial Aggregation
{ "output": { "aggregation": { "geoNumReturn": integer, } } }
All of the "[type]NumReturn" fields simply configure the number of entries returned (in the case of "geo", (lat,long) pairs), in order of frequency in the query-matching dataset.
The output format and an example is show below:
{ //... "geo": [ { "type": string, // the ontology type - see below "lat": number, "lon": number, "count": integer // the number of occurrences } ], "maxGeoCount": number //... }
//curl -XGET 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?qt[0].etext=%22*%22&input.tags=%22topic:technology%22&output.aggregation.geoNumReturn=100&output.docs.enable=false' { response: { action: "Query" success: true message: "((*))" time: 296 }, stats: { found: 53314 start: 0 maxScore: 0 avgScore: 0 }, geo: [ { type: "city" lat: 37.77499996125698 lon: -122.4183003231883 count: 1494 }, { type: "point" lat: 47.60639989748597 lon: -122.3308002948761 count: 442 }, { type: "geographicalregion" lat: 37.441899944096804 lon: -122.14190002530813 count: 206 }, //(etc) ], maxGeoCount: 9910 }
The "maxGeoCount" field in the top-level response is simply the highest count that occurs in the list (which is ordered by the underlying geohash used to store the lat/long). This can be used to calculate scaling factors without first having to traverse the return array.
The "type" field (ontological type) is discussed under the Geo JSON format.
A typical use of the "geo" aggregation is to show heatmaps: in this case a "geoNumReturn" value of at least 1000 is recommended for large datasets.
Temporal Aggregation
{ "output": { "aggregation": { "timesInterval": string, } } }
Temporal aggregation has a different configuration parameter to the others. Instead of specifying a number of entries to return, a string specifies the interval over which a document count is to be summed. This is in the standard format: "N[hdwmy]" ie an integer followed by h (hour), d (day), w (week), m (month), y (year).
Note that if "m" for month is the interval unit, then the aggregation is always performed over 1 month intervals, regardless of the "N"
The output format is very simple:
{ //... "times": [ { "time": long, "count": integer // the number of occurrences } ], "timeInterval": long, //... }
The "time" field in the "times" array is the start time of the interval in "ms" Unix time (milliseconds after 1970). The top-level "timeInterval" is the duration of that interval in ms, ie each interval can be expressed as ["times.time", "times.time"+"timeInterval"]
.
Even though the document counts are sorted by time rather than by "count", unlike geo-spatial aggregation no maximum count is provided. This is part oversight, part because it is not so (performance) critical to know the scaling factors in advance, but it will probably be corrected in a future release.
//curl -XGET 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?qt[0].etext=%22*%22&input.tags=%22topic:technology%22&output.aggregation.timesInterval="1w"&output.docs.enable=false' { response: { ... }, stats: { ... }, times: [ { time: 977702400000 count: 2 }, { time: 993427200000 count: 1 }, { time: 1041206400000 count: 3 }, //etc ], timeInterval: 604800000 }
See also the more granular per-entity temporal aggregation available using "moments".
Entity Aggregation
Entity aggregation preserves the format of the "entities" sub-objects of the document, but across all documents in the query-matching dataset.
- (In fact due to implementation limitations, currently only the top "score.numAnalyze" eg 1000 documents are used to generate the entity aggregations)
{ "output": { "aggregation": { "entsNumReturn": integer, } } }
It is worth noting that entities are returning in descending significance order (the other sorted aggregation types such as "geo" and "events" are ranked by frequency and are actually generated from the entire matching dataset, rather than a subset). A future release will try to standardize use of frequency vs significance, and also remove the use of subsets where possible.
The entity output format is identical to the entity sub-object described here, except that the per-document fields ("significance" and "frequency") are replaced with the maximum per-document values in the matching sub-set, and some other fields ("actual_name", "relevance", "sentiment" are not present).
//curl -XGET 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?qt[0].etext=%22*%22&input.tags=%22topic:technology%22&output.aggregation.entsNumReturn=10&output.docs.enable=false' { response: { ... }, stats: { ... }, entities: [ { dimension: "Who" disambiguated_name: "LulzSec" doccount: 35 frequency: 26 index: "lulzsec/organization" totalfrequency: 348 type: "Organization" significance: 6.874384562114902 datasetSignificance: 6.001782525850459 queryCoverage: 0.03835649052289408 averageFreq: 0.054 }, { dimension: "Who" disambiguated_name: "Oracle Corporation" doccount: 827 frequency: 20 index: "oracle corporation/company" linkdata: [ http://d.opencalais.com/er/company/ralg-tr1r/eab9bfaa-47f1-368a-a9b7-a87bb345cf30 ] totalfrequency: 2541 type: "Company" significance: 9.882910950748887 datasetSignificance: 5.5199530144758615 queryCoverage: 0.8277825954323541 averageFreq: 0.076 }, //etc ] }
Good "entsNumReturn" values vary with application. For a document set that will not be filtered, 100 is a good value. For "recommendation" displays (eg "Other entities you may be interested in"), as few as 5-10 works fine. For larger datasets where the user will filter down from the initial return set then 1000+ is recommended.
See also the more granular per-entity temporal aggregation available using "moments".
Event and Fact Aggregation
As described under their format specification, events are split into 3 categories:
- "Events": link multiple entities (via "entity1_index", "entity2_index", "geo_index") and represent a transient activity (eg travel)
- "Facts": link multiple entities like "Events" but represent (transient or permanent) relationships (eg being president)
- "Summaries": generally link 1 entity to a free text (eg a quotation: "Obama says...").
Summaries cannot currently be aggregated (except manually or via the "output.docs.eventsTimeline" function), because of (surmountable but non-trivial) implementation issues, combined with its perceived low priority. It is unclear whether it will get added in the future.
The configuration format is straightforward. As described under entities, events and facts are ranked by frequency not significance (but this is likely to be an option in the future).
{ "output": { "aggregation": { "eventsNumReturn": integer, "factsNumReturn": integer, } } }
Similar to entities, the event/fact output format is essentially the same as the Documents and their sub-objects (entities, associations, user metadata, aggregations)#Eventdocument sub-object format, although fewer fields are populated:
{ //... "events": [ { "event_type": "Event", "entity1_index": string, "verb_category": string, "entity2_index": string, "geo_index": string, "assoc_sig": number, // A significance score for the association object (see below) "entity1_sig": number, // A significance score for entity1, if present "entity2_sig": number, // A significance score for entity2, if present "geo_sig": number, // A significance score for the geo, if present "doccount": integer // the number of occurrences } ], "facts": [ { "event_type": "Fact", "entity1_index": string, "verb_category": string, "entity2_index": string, "geo_index": string, "assoc_sig": number, // A significance score for the association object (see below) "entity1_sig": number, // A significance score for entity1, if present "entity2_sig": number, // A significance score for entity2, if present "geo_sig": number, // A significance score for the geo, if present "doccount": integer // the number of occurrences } ], //... }
//curl -XGET 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?qt[0].etext=%22*%22&input.tags=%22topic:technology%22&output.aggregation.eventsNumReturn=2&output.aggregation.factsNumReturn=2&output.docs.enable=false' { response: { ... }, stats: { ... }, events: [ { event_type: "Event" entity1_index: "microsoft corporation/company" verb_category: "acquisition" entity2_index: "skype technologies s.a./company" assoc_sig: 13.454 entity1_sig: 12.5 entity2_sig: 14.5 doccount: 69 }, { event_type: "Event" entity1_index: "google inc./company" verb_category: "product release" entity2_index: "google+/product" assoc_sig: 35.123324 entity1_sig: 10.5123 entity2_sig: 94.235435 doccount: 14 } ], facts: [ { event_type: "Fact" entity1_index: "google inc./company" verb_category: "company product" entity2_index: "android/product" assoc_sig: 23.34546 entity1_sig: 10.5123 entity2_sig: 54.235576 doccount: 244 }, { event_type: "Fact" entity1_index: "tom XXX/person" verb_category: "person email address" entity2_index: "tXXX5@bloomberg.net/emailaddress" assoc_sig: 43.234324 entity1_sig: 43.234324 entity2_sig: 0 doccount: 72 } ] }
Note that "time_start" and "time_end" are aggregated out of the object, ie all time information is lost. To aggregate events over time, use "output.docs.eventsTimeline". It is likely at some point that the two formats will be combined somehow. At present, "output.aggregation.eventsNumReturn" and "output.aggregation.factsNumReturn" is best used with "link analysis" style applications, and "output.docs.eventsTimeline" is best used for timeline style applications.
Moments: Per Entity Temporal Aggregation
The "moments" function allows the entity aggregation to be combined with the "times" aggregation, generating a list of time periods in which named entities were mentioned, together with counts of the mentions for each time period.
Coming versions of the platform will enhance this capability further, eg providing aggregated sentiment for the named entities.
{ //... "output": { //... "aggregation": { //... "moments": { "timesInterval": string, // the time period over which the values are aggregated - same format as "output.aggregation.timesInterval" "geoNumReturn": integer, // (ALPHA) For each time interval, a list of geo buckets in the same format as "response.geo" (including maxGeoCount) "entityList": [ string ] // (BETA) A list of entity indexes, eg "barack obama/person" } //... } //... } //... }
Note that if no "output.aggregation.moments.timesInterval" is set, then the time will be taken from "output.aggregation.timesInterval" if available, and defaulted to 1 month otherwise.
Note that the entityList respects aliases.
REQUEST (fragment) { "moments": { "timesInterval": "1m", "geoNumReturn": 2, "entityList": [ "barack obama/person", "mitt romney/person" ] } } REPLY (fragment) { "moments": { "times": [ { "time": 1346457600000, "count": 1, "maxGeoCount": 2936, "geo": [ { "lat": 40.71416669525206, "lon": -74.00638880208135, "count": 2936, "type": "city" } ] } ], "barack obama/person": [ { "time": 1346457600000, "count": 1 }, { "time": 1351728000000, "count": 2 }, { "time": 1354320000000, "count": 1 }, { "time": 1356998400000, "count": 9 } ], "mitt romney/person": [ { "time": 1346457600000, "count": 1 }, { "time": 1351728000000, "count": 1 } ] } }
Sources and Source Metadata Aggregation
It can often be useful to understand what sources/source categories documents are being returned from. Community Edition allows the following aggregations:
- Individual sources (using the "sourceKey" field of the document object, ie the "key" field of the "source" object)
- Source types (using the "mediaType" field of the document object, ie the "mediaType" field of the "source" object)
- Source tags (using the "tags" field of the document object, ie the "tags" field of the "source" object)
- There is currently no concept of "per document" tags (eg auto-generated from the document content), though there may be in the future.
- Note that the Infinit.e "system collection" contains two "top level" tags, "topic:<tag>" and "industry:<tag>" (its general, or "content" tags tend to be quite low level and difficult to use)
The first of these is configured by "output.aggregation.sources", the second two by "output.aggregation.sourceMetadata":
{ "output": { "aggregation": { "sources": integer, "sourceMetadata": integer, } } }
The output format is very simple: arrays "sources", "sourceMetaTags", and "sourceMetaTypes", with fields "term" (string) and "count" (integer).
Examples:
//curl -XGET 'http://infinite.ikanow.com/api/knowledge/query/4c927585d591d31d7b37097a?qt[0].etext=%22*%22&input.tags=%22topic:technology%22&output.aggregation.sources=5&output.aggregation.sourceMetadata=5&output.docs.enable=false' { "response": { "action": "Query", "success": true, "message": "((*))", "time": 140 }, "stats": { "found": 53560, "start": 0, "maxScore": 0, "avgScore": 0 }, "sources": [ { "term": "feed:..origin.feeds.pheedo.com.bw.technology_news-rss", "count": 11182 }, { "term": "http.gizmodo.com.index.xml", "count": 3102 }, { "term": "http.www.reddit.com.r.technology..rss", "count": 2391 }, { "term": "feed:..origin.feeds.pheedo.com.bw.energy_news-rss", "count": 2390 }, { "term": "http.www.engadget.com.rss.xml", "count": 1687 } ], "sourceMetaTags": [ { "term": "topic:technology", "count": 53560 }, { "term": "news", "count": 44708 }, { "term": "industry:technology", "count": 38355 }, { "term": "technology", "count": 30871 }, { "term": "industry:all", "count": 15205 } ], "sourceMetaTypes": [ { "term": "News", "count": 53281 }, { "term": "Video", "count": 279 } ] }
Raw Access to ElasticSearch "Facets"
In the same way that it is possible for queries, there is the option simply to pass an arbitrary "facet" (ie aggregation) through to ElasticSearch, using its raw API. Like for queries, this functionality should be considered an absolute last resort.
Unlike for queries, where the raw ElasticSearch query is specified as a JSON object, raw facets are specified as a string conversion of the JSON object (this may change in a future release), example:
{ "output": { "aggregation": { "raw": "{\"sources\":{\"terms\":{\"field\":\"sourceKey\",\"size\":5}}}" } } }
The above example is functionally equivalent to specifying "&output.aggregation.sources=5"
, except that the output would be in the array "facets.sources" instead of the top-level sources, eg:
{ "response": { "action": "Query", "success": true, "message": "((*))", "time": 140 }, "stats": { "found": 53560, "start": 0, "maxScore": 0, "avgScore": 0 }, "facets": { "sources": [ { "term": "feed:..origin.feeds.pheedo.com.bw.technology_news-rss", "count": 11182 }, { "term": "http.gizmodo.com.index.xml", "count": 3102 }, { "term": "http.www.reddit.com.r.technology..rss", "count": 2391 }, { "term": "feed:..origin.feeds.pheedo.com.bw.energy_news-rss", "count": 2390 }, { "term": "http.www.engadget.com.rss.xml", "count": 1687 } ], // +Any other "raw" facets specified } }
Note finally that, like for queries, specifying any facets overrides any Infinit.e aggregations (except currently for entities, though this may change).
Filtering
{ "output": { "filter": { "entityTypes": [ string ], // A list of (case sensitive) entity types - if specified non-matching entities and associations and documents will be discarded "assocVerbs": [ string ] // A list of (case sensitive) verb categories - if specified non-matching associations and documents will be discarded } } }
Either of the above filters can be made "negative" by inserting a "-" in front of the first entry in the array. Negative filtering simply removes all entities or associations that match the filter from the document (and also their score). Note that queries can still match on negatively filter entities and associations.
Examples:
// // Twitter example: only pull back hashtags and twitter handles from tweets (and discard documents and associations not containing either) // { "output": { "filter": { "entityTypes": [ "HashTag", "TwitterHandle" ] } } } // // Negative twitter example: remove all keywords and locations extracted for the tweet // { "output": { "filter": { "entityTypes": [ "-Keyword", "Location" ] } } } // // Twitter example: pull back all entities, but only tweets that are retweets (and only retweet associations) // { "output": { "filter": { "assocVerbs": [ "retweets" ] } } } // // Twitter example: this will discard all associations (because "rewteets" are always associations between TwitterHandle types) // { "output": { "filter": { "entityTypes": [ "Keyword" ], "assocVerbs": [ "retweets" ] } } } // // Business acquisition example // { "output": { "filter": { "entityTypes": [ "Company", "Organization" ], "assocVerbs": [ "acquires" ] } } }
There is one implementation issue that is noteworthy: the association verb category is stored in such a way that subsets of the phrase will match. For example, "generic" or "relations" will match on "generic relations", and "generic relations" will match on "generic relations (special case)".