Overview

This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).

Format

{
	"display": string,
	"featureEngine": {
		"criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
		"enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
		"engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine
		"entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) 
	}
}

Description

Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata. Text extraction is a separate stage in the pipeline with different extraction engines.

Most feature extractors require for text to have been extracted with a "textEngine" or "text" object before it in the pipeline, unless the data comes from file extractor (which automatically fills in a document's "fullText" field).

For a list of supported text extractors, see Automated text extraction.

For example Alchemy API can perform both text extraction using the Alchempy API, and feature extraction using the Alchemy metadata API.

The following table describes the parameters of the feature extraction configuration.

Field	Description
criteria	A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
enginename	The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
engineConfig	The configuration object to be passed to the engine
entityFilter	(regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
assocFilter	(regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)

IKANOWS supports the following feature extraction engines:

Textrank*
OpenCalais*
AlchemyAPI-metadata**
salience*

*requires a text extractor beforehand.

*includes its own built-in text extractor.

Examples

This section describes the configuration details for the supported extractors, and provides examples where applicable.

OpenCalais

The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig parameter.

Parameter Description Data Type

store_raw_events

Possible values:

True or false

False by default.

If enabled, a metadata field called "OpenCalaisEvents" is tagged to the document containing the raw JSON for events. This can be used to analyze new event definitions so they can be incorporated into the global OpenCalais configuration. It can also be used as a workaround via the structured analysis harvester where this is not possible.

Examples

The following example source uses Alchemy API as the text engine, and OpenCalais as the feature engine. In both cases, the default configuration of these engines is used to output entities and associations for the ingested RSS data.

{
    "description": "Article on Medical Issues",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": true,
    "key": "http.www.mayoclinic.com.rss.blog.xml",
    "mediaType": "News",
    "modified": "Oct 19, 2010 11:31:59 AM",
    "tags": [
        "topic:healthcare",
        "industry:healthcare",
        "mayo clinic",
        "health"
    ],
    "title": "MayoClinic: General Topics",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://www.mayoclinic.com/rss/blog.xml"
                    }
                ]
            }
        },
        {
            "textEngine": {
                "engineName": "AlchemyAPI"
            }
        },
        {
            "featureEngine": {
                "engineName": "OpenCalais"
            }
        }
    ]
}

Salience

The following custom configuration parameters are possible for Salience and can be set using the engineConfig parameter.

Parameter	Description	Note
`data_path`	Specifies the path where salience should ingest data from. See examples below.	Salience 5.1.6867: When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data". Salience 5.1.1.7298: When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message.
`license_path`	Specifies the path to the salience license. See examples below.
`short_form_content`	If "true" (default "false") then optimizes for short form content such as twitter.
`generate_categories`	If "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics).
`generate_entities`	If "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text.
`generate_keywords`	If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document). Note that Infinit.e keywords correspond to "themes" in Salience documentation.
`kw_score_threshold`	If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.
`generate_keyword_associations`	If "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations.
`query_topic_file`	Points to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories.
`concept_topic_file`	Points to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories.
`concept_topic_explain`	If "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions.
`topics_to_tags`	If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags. Note that the Salience documentation refers to topics as both "concepts" or "tags" depending on how they are generated.
`topics_to_entities`	If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What".
`geolocate_entities`	If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience. NOTE: this functionality is not currently very accurate. If false positives are worse than true negatives then set this to "false".
`topic_score_threshold`	If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.
`evidence_threshold`	If set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality.
`doc_summary_size`	The number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is).

Examples

Setting the data path:

The data path for salience should be set using the following format:

"<BASEDIR>/data"

where BASEDIR is the environment variable "lxainstall" - will normally be set to "/opt/lexalytics/salience-5.x/"

salience configuration values do not need to be prefixed by 'app."

example format:

"salience.data_path"

Data path can be modified when running the harvest engine in a non-standard configuration. eg. running locally during development.

Setting the data path for language packs:

You can use the data_path parameter to switch between different language packs.

To set the language pack for Spanish

"/opt/lexalytics/salience-5.x/spanish".

If the parameter doesn't start with a "/" then it is assumed to be relative to BASEDIR, eg "spanish" is sufficient in the preceding example.

Setting the license_path:

You can use the license_path parameter to specify the license for salience.

For example

"<BASEDIR>/license.v5".

Should only need to be changed when running the harvest engine locally. Paths that don't start with "/" are assumed to be relative to BASEDIR.

Legacy documentation:

Replaces "useExtractor" in the Source object
(Note "criteria" above is not currently supported - coming soon!)

TODO

Legacy documentation:

Enrichment engines

TODO