Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).

...

Code Block
{
	"display": string,
	"featureEngine": {
		"criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
		"enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
		"engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine
		"entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) 
	}
}

Description

Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata.  Text extraction is a separate stage in the pipeline with different extraction engines.

...

Code Block
{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"url,sourceUrl/What/FileType": "s/\\.([a-z]{3})$/$1/i",
			"/[^.]*url$|metadata\\..*filename.*/i/What/FileName": "s/[^\\/]+\\.[a-z]{3}$/i"
		}
	}
}

Build up complex sets of fields using "field variables"

...