Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).

...

Code Block
{
	"display": string,
	"featureEngine": {
		"criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
		"enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
		"engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine
		"entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only),
		"exitOnError": boolean // if true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue.
	}
}

Description

Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata.  Text extraction is a separate stage in the pipeline with different extraction engines.

...

ParameterDescriptionNoteData Type
data_path

Specifies the path where salience should ingest data from.

See examples below.

Salience 5.1.6867:

When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data".

Salience 5.1.1.7298:

When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message.

 
license_path

Specifies the path to the salience license.

See examples below.

  
short_form_contentIf "true" (default "false") then optimizes for short form content such as twitter.  
generate_categoriesIf "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics).
  
decompose_categoriesIf "true" (default "false"), and "generate_categories" is also "true", then will generate more granular sub-topics  
generate_entitiesIf "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text.
  
generate_keywords

If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document).

Info

Note that Infinit.e keywords correspond to "themes" in Salience documentation.

  
kw_score_threshold

If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.

 

  
generate_keyword_associationsIf "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations.  
query_topic_filePoints to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories.  
concept_topic_filePoints to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories.  
concept_topic_explainIf "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions.  
topics_to_tags

If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags.

Info

Note that the Salience documentation refers to topics as both "concepts" or "tags" depending on how they are generated.


  
topics_to_entities If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What".  
geolocate_entities

If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience.

Info

NOTE: this functionality is not currently very accurate. If false positives are worse than true negatives then set this to "false".

 

 

  
topic_score_threshold If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.
  
evidence_thresholdIf set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality.  
doc_summary_sizeThe number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is).  

...