Overview
This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).
...
Code Block |
---|
{ "display": string, "featureEngine": { "criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed "enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration) "engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine "entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only), "exitOnError": boolean // if true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue. } } |
Description
Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata. Text extraction is a separate stage in the pipeline with different extraction engines.
...
Field | Description |
---|---|
criteria | A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed |
engineName | The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration) |
engineConfig | The configuration object to be passed to the engine |
entityFilter | (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) |
assocFilter | (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) |
exitOnError | If true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue. |
engineConfig
The "engineConfig" configuration object is a set of key/value pairs of strings that depends on the extractor type, "pre-integrated" configurations are described below, eg:
...
IKANOWS supports the following feature extraction engines:
- Textrank* ("textrank")
- OpenCalais* ("opencalais")
- AlchemyAPI** ("alchemyapi")
- AlchemyAPI-metadata** ("alchemyapi-metadata")
- salience* ("salience")
- regex* - a mechanism for converting regexes into entities from text or metadata ("regex")
*requires a text extractor beforehand.
...
Extracts entities and associations using the free OpenCalais service. No sentiment analysis function is available at this time.
OpenCalais will truncate text that is larger than 99KB.
The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig
parameter.
...
Uses the Named Entity Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). AlchemyAPI has the ability to extract associations (as well as much more), but this feature has not yet been integrated into the tool.
There are two Alchemy services that can be called:
- Alchemy API
- Alchemy API-metadata*
*includes many of the same features of Alchemy API but also allows more advanced batching of documents and keyword control.
AlchemyAPI will truncate text longer than 145KB.
Both of these services can support both text extraction and feature extraction. However, if you only need to perform text extraction Alchemy API should be used.
...