Overview

This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).

...

Code Block

{
	"display": string,
	"featureEngine": {
		"criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
		"enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
		"engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine
		"entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) 
	}
}

Description

Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata. Text extraction is a separate stage in the pipeline with different extraction engines.

Warning

Most feature extractors require for text to have been extracted with a "textEngine" or "text" object before it in the pipeline, unless the data comes from file extractor (which automatically fills in a document's "fullText" field).

For a list of supported text extractors, see Automated text extraction.

For example Alchemy API can perform both text extraction using the Alchempy API, and feature extraction using the Alchemy metadata API.

The following table describes the parameters of the feature extraction configuration.

...

Code Block
{ "featureEngine": { "engineName": "salience", "engineConfig": { "salience.shortFormContent": "true", "salience.kw_score_threshold": "0.5" } } }

Standard feature extractors

...

shortFormContent": "true",
			"salience.kw_score_threshold": "0.5"
		}
	}
}

Examples

This section describes the configuration details for the supported extractors, and provides examples where applicable.

Alchemy API

There are two Alchemy services that can be called:

Alchemy API
Alchemy API-metadata*

*includes many of the same features of Alchemy API but also allows more advanced batching of documents and keyword control.

Both of these services can support both text extraction and feature extraction. However, if you only need to perform text extraction Alchemy API should be used.

Alchemy API Configuration

You can use engineConfig to pass the parameters of the Alchemy API configuration, as described below

Parameter Description

postproc

Possible values:

"1","2","3"

Default value is "3."

"1" does some post-processing of geographic entities (AlchemyAPI tends to prefer US results even when the context clearly indicates a US location),

"2" does some post-processing of person entities (AlchemyAPI tends to prefer famous people even when the context does not support that)

"3" does both.

sentiment

Possible values:

True or False.

Default value is True.

If enabled, a sentiment metric is attached to each extracted entity.

Info
Note that this results in use of an extra AlchemyAPI credit per document.

concepts

Possible values:

True or false.

Default value is false.

If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document.

Info
Note that this results in use of an extra AlchemyAPI credit per document.

Example 1: Using Alchemy API As A Text Extractor

In the example below, Alchemy API is only used as a text extractor. As such most of the configuration parameters are not applicable and the default settings can be taken. In this specific example, featureEngine uses OpenCalais.

Source Configuration:

Code Block

{
    "description": "Article on Medical Issues",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": true,
    "key": "http.www.mayoclinic.com.rss.blog.xml",
    "mediaType": "News",
    "modified": "Oct 19, 2010 11:31:59 AM",
    "tags": [
        "topic:healthcare",
        "industry:healthcare",
        "mayo clinic",
        "health"
    ],
    "title": "MayoClinic: General Topics",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://www.mayoclinic.com/rss/blog.xml"
                    }
                ]
            }
        },
        {
            "textEngine": {
                "engineName": "AlchemyAPI"
            }
        },
        {
            "featureEngine": {
                "engineName": "OpenCalais"
            }
        }
    ]
}

Output:

The output contains the "description" and entities resulting from the textEngine and featureEngine settings.

{
    "_id" : "4e1c8afa7d56bb818ed10f76",
    "created" : "1310493434159",
    "description" : "Clarify the role of carbohydrates in the Dr. Bernstein diet and find a 
         healthy eating plan that works for you.",
    "entities" : [
    {
        "actual_name" : "certified diabetes",
        "dimension" : "What",
        "disambiguous_name" : "certified diabetes",
        "doccount" : NumberLong(38),
        "frequency" : 3,
        "gazateer_index" : "certified diabetes/medicalcondition",
        "relevance" : "0.711",
        "totalfrequency" : NumberLong(114),
        "type" : "MedicalCondition"
    },
    {
        "actual_name" : "Diabetes Unit",
        "dimension" : "Who",
        "disambiguous_name" : "Diabetes Unit",
        "doccount" : NumberLong(38),
        "frequency" : 1,
        "gazateer_index" : "diabetes unit/organization",
        "relevance" : "0.235",
        "totalfrequency" : NumberLong(38),
        "type" : "Organization"
    },
    {
        "actual_name" : "Mayo Clinic",
        "dimension" : "What",
        "disambiguous_name" : "Mayo Clinic",
        "doccount" : NumberLong(514),
        "frequency" : 2,
        "gazateer_index" : "mayo clinic/facility",
        "relevance" : "0.305",
        "totalfrequency" : NumberLong(1033),
        "type" : "Facility"
    },

Alchemy API metadata

You can use engineConfig to pass the parameters of the Alchemy API metadata configuration, as described below.

Parameter Description Data Type

sentiment

Possible values:

True or false

False is default value.

If enabled, a sentiment metric is attached to each extracted entity.

Info
Note that this results in use of an extra AlchemyAPI credit per document.

concepts

Possible values:

True or false.

True is default setting.

If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document.

Info
Note that this results in use of an extra AlchemyAPI credit per document.

batchSize

a string containing an integer, turned off by default. If turned on, the AlchemyAPI call goes out on a batch of documents (the specified number). This makes processing of small documents like tweets more economical (in return for a reduction in accuracy, eg the sentiment is calculated over the batch not each individual tweet).

string,

integer

numKeywords a string containing an integer, uses the AlchemyAPI default (currently 50) if not specified. If specified, controls the number of keywords returned. If batching is enabled then the requested number is multiplied by the batch size. string, integer

strict

Possible values:

True or False.

False is default setting.

If enabled, fewer high quality keywords are extracted.

Example 2: Using Alchemy API-metadata for Feature Extraction

In this example, Alchemy API metadata is used for feature extraction. It is configured to act on a batch of documents (100) and to return a maximum of 5 keywords per document. The strict setting will return more high quality keywords, and less keywords overall.

Source Configuration:

The source configuration shows how Alchemy API Metadata parameters can be used to set batch sizing and keywords settings. In addition, the beginning of the entities block is included to show how automatic feature extraction and manual entities can be combined to achieve highly customizable results.

Code Block

 },
        {
            "featureEngine": {
                "engineName": "AlchemyAPI-metadata",
                "engineConfig": {
                    "app.alchemyapi-metadata.batchSize": 100,
                    "app.alchemyapi-metadata.numKeywords": 5,
                    "app.alchemyapi-metadata.strict": "true"
                }
            }
        },
        {
            "entities": [
                {
                    "actual_name": "$metadata.json.actor.displayName",
                    "dimension": "Who",
                    "disambiguated_name": "$metadata.json.actor.preferredUsername",
                    "linkdata": "$metadata.json.actor.link",
                    "type": "TwitterHandle"
                },

Output:

The output reveals the results of featureEngine and entities. The entities are returned indexed by keyword.

Code Block

   },
        {
            "actual_name": "Amex Teams",
            "dimension": "What",
            "disambiguated_name": "Amex Teams",
            "doccount": -1,
            "frequency": 1,
            "index": "amex teams/keyword",
            "relevance": 0.758636,
            "sentiment": 0.160753,
            "totalfrequency": -1,
            "type": "Keyword"
        },
        {
            "actual_name": "Halo",
            "dimension": "What",
            "disambiguated_name": "Halo",
            "doccount": -1,
            "frequency": 1,
            "index": "halo/keyword",
            "relevance": 0.461833,
            "sentiment": 0.168822,
            "totalfrequency": -1,
            "type": "Keyword"
        },
        {
            "actual_name": "Master Chief Incentives",
            "dimension": "What",
            "disambiguated_name": "Master Chief Incentives",
            "doccount": -1,
            "frequency": 1,
            "index": "master chief incentives/keyword",
            "relevance": 0.981457,
            "sentiment": 0.168876,
            "totalfrequency": -1,
            "type": "Keyword"
        },

...

Regex

TODO IN PROGRESS

Unlike many other extractors, the regex extractor does nothing by default - the configuration defines its functionality, and can vary from simple to more sophisticated. This section describes different cases:

Step 1: Define regexes that create entities, running against the document's fullText

The keys are either the type, or the type-then-"/"-then-the dimension (eg "Person" or "Person/Who"). If the dimension is not specified then the system tries to guess, defaulting to "Who"
The value is the regex, in one of the following formats
- "/<regex-pattern>/<optional-flags>" - will extract the entire pattern matched as an entity
- "s/<regex-pattern>/<replacement-string/<flags>" - will extract the replacement string (with $1, $2, etc representing the capturing groups)"

Example 1:

Code Block
{ "featureEngine": { "engineName": "regex", "engineConfig": { "Sha256Hash": "/[0-9a-fA-F]{64}/", "ExternalIp/Who": "s/(?:^\|[^0-9a-z])([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)($\|[^0-9a-z])/hash:$1/i" } } }

Step 2: Define the default fields over which to search

By default, only the doc.fullText field is searched. The special key "$" defines the fields to be configured. It can have the following formats:

comma-separated list of document fields, eg "fullText,description.title"
- (the field list can also include "field variables" - this is discussed below under Step 4)
a single regex in the format "/<regex-pattern>/<optional-fields>" - this regex is applied to each field in the document and only matching fields are scanned.

Example 2 and 3:

Code Block

{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"$": "fullText,description,title",
			"StreetAddress/Where": "/[0-9]+ [a-z_-]+ (?:Road|Street|Avenue)/i"
		}
	}
}
//alternative - will search fullText, description, and any metadata field with "address" in the dot-notation path
{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"$": "/(?:fullText|description|metadata\..*\.address.*)/",
			"StreetAddress/Where": "/[0-9]+ *,? *[a-z_-]+ *(?:Road|Street|Avenue)/i"
		}
	}
}

Step 3: Specify different regexes for different fields

It is possible to restrict individual regexes to run on a subset of fields. This is performed with the following "key":

...

The value field is the same (the regex to apply to the specified stream).

Example 4:

Code Block
{ "featureEngine": { "engineName": "regex", "engineConfig": { "url,sourceUrl/FileType/What": "s/\.([a-z]{3})$/$1/i", "/[^.]url$\|metadata\..filename.*/i/FileName/What": "s/[^/]+\.[a-z]{3}$/i" } } }

Step 4: Build up complex sets of fields using "field variables"

Finally, it is possible to build up more complex sets of field lists in increments using "field variables". In these cases the keys are in the format "$<saved-field-name>", eg "$regexList", "$fieldList" and the value is either the field list (which can itself include earlier "field variables") or a regex that is matched across fields, ie in the same way as the "$" default.

Example 5:

In this example, the default is a small number of fields. The "Sha256Hash" entity type only looks over those. The "FileType" regex will run across a much larger set of fields.

Code Block

{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"$docFields": "fullText,description,title",
			"$extendedDocFields": "$docFields,url,sourceUrl",
			"$metaFields": "/metadata\..*content.*/i",
			"$moreFields":"$metaFields,$extendedDocFields",
 
			"$": "$docFields",
 
			"Sha256Hash": "/[0-9a-fA-F]{64}/"
			"$moreFields/FileType/What": "s/\.([a-z]{3})$/$1/i",
		}
	}
}

...

OpenCalais

The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig parameter.

...

Parameter Description Data Type

store_raw_events

Possible values:

True or false

False by default.

If enabled, a metadata field called "OpenCalaisEvents" is tagged to the document containing the raw JSON for events. This can be used to analyze new event definitions so they can be incorporated into the global OpenCalais configuration. It can also be used as a workaround via the structured analysis harvester where this is not possible.

Examples

The following example source uses Alchemy API as the text engine, and OpenCalais as the feature engine. In both cases, the default configuration of these engines is used to output entities and associations for the ingested RSS data.

...

Parameter Description Note Data Type

data_path

Specifies the path where salience should ingest data from.

See examples below.

Salience 5.1.6867:

When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data".

Salience 5.1.1.7298:

When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message.

license_path

Specifies the path to the salience license.

See examples below.

short_form_content If "true" (default "false") then optimizes for short form content such as twitter.

generate_categories If "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics).

generate_entities If "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text.

generate_keywords

If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document).

Info
Note that Infinit.e keywords correspond to "themes" in Salience documentation.

kw_score_threshold

If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.

generate_keyword_associations If "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations.

query_topic_file Points to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories.

concept_topic_file Points to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories.

concept_topic_explain If "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions.

topics_to_tags

If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags.

Info
Note that the Salience documentation refers to topics as both "concepts" or "tags" depending on how they are generated.

topics_to_entities If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What".

geolocate_entities

If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience.

Info
NOTE: this functionality is not currently very accurate. If false positives are worse than true negatives then set this to "false".

topic_score_threshold If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.

evidence_threshold If set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality.

doc_summary_size The number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is).

Examples

Anchor

	salience data_path
	salience data_path

Setting the data path:

...

Should only need to be changed when running the harvest engine locally. Paths that don't start with "/" are assumed to be relative to BASEDIR.

...

TextRank

TODO examples

Panel

Legacy documentation:

Replaces "useExtractor" in the Source object

Legacy documentation:

Enrichment engines

...

Versions Compared

Old Version 17

New Version 18

Key

Overview

Description

Standard feature extractors

Examples

Alchemy API

Alchemy API Configuration

Example 1: Using Alchemy API As A Text Extractor

Alchemy API metadata

Example 2: Using Alchemy API-metadata for Feature Extraction

Regex

Step 1: Define regexes that create entities, running against the document's fullText

Step 2: Define the default fields over which to search

Step 3: Specify different regexes for different fields

Step 4: Build up complex sets of fields using "field variables"

OpenCalais

Examples

Examples

TextRank

Page Comparison

Versions Compared

Old Version 17

New Version 18

Key

Description

Standard feature extractors

Examples

Alchemy API

Alchemy API Configuration

Example 1: Using Alchemy API As A Text Extractor

Alchemy API metadata

Example 2: Using Alchemy API-metadata for Feature Extraction

Regex

Step 1: Define regexes that create entities, running against the document's fullText

Step 2: Define the default fields over which to search

Step 3: Specify different regexes for different fields

Step 4: Build up complex sets of fields using "field variables"

OpenCalais

Examples

Examples

TextRank