Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).

...

Code Block
{
	"display": string,
	"featureEngine": {
		"criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
		"enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
		"engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine
		"entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only),
		"exitOnError": boolean // if true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue.
	}
}

Description

Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata.  Text extraction is a separate stage in the pipeline with different extraction engines.

...

FieldDescription
criteria

A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed

engineName

The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)

engineConfig

The configuration object to be passed to the engine

entityFilter

(regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)

assocFilter

(regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)

exitOnError

If true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue.

engineConfig

The "engineConfig" configuration object is a set of key/value pairs of strings that depends on the extractor type, "pre-integrated" configurations are described below, eg:

...

IKANOWS supports the following feature extraction engines:

  • Textrank* ("textrank")
  • OpenCalais* ("opencalais")
  • AlchemyAPI** ("alchemyapi")
  • AlchemyAPI-metadata** ("alchemyapi-metadata")
  • salience* ("salience")
  • regex* - a mechanism for converting regexes into entities from text or metadata ("regex")

*requires a text extractor beforehand.

...

Extracts entities and associations using the free OpenCalais service. No sentiment analysis function is available at this time.

OpenCalais will truncate text that is larger than 99KB.

The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig parameter.

...

Uses the Named Entity Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). AlchemyAPI has the ability to extract associations (as well as much more), but this feature has not yet been integrated into the tool.

There are two Alchemy services that can be called:

  • Alchemy API
  • Alchemy API-metadata*

*includes many of the same features of Alchemy API but also allows more advanced batching of documents and keyword control.

AlchemyAPI will truncate text longer than 145KB.

Both of these services can support both text extraction and feature extraction.  However, if you only need to perform text extraction Alchemy API should be used.

...