Overview
This toolkit element passes the document full text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).
...
Code Block |
---|
{ "display": string, "featureEngine": { "criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed "enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration) "engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine "entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) } } |
Description
Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata. Text extraction is a separate stage in the pipeline with different extraction engines.
Warning |
---|
Most feature extractors require for text to have been extracted with a "textEngine" or "text" object before it in the pipeline, unless the data comes from file extractor (which automatically fills in a document's "fullText" field). For a list of supported text extractors, see Automated text extraction. |
For example Alchemy API can perform both text extraction using the Alchempy API, and feature extraction using the Alchemy metadata API.
The following table describes the parameters of the feature extraction configuration.
...
Code Block |
---|
{
"featureEngine": {
"engineName": "salience",
"engineConfig": {
"salience.shortFormContent": "true",
"salience.kw_score_threshold": "0.5"
}
}
} |
Standard feature extractors
...
shortFormContent": "true",
"salience.kw_score_threshold": "0.5"
}
}
} |
Examples
This section describes the configuration details for the supported extractors, and provides examples where applicable.
Alchemy API
There are two Alchemy services that can be called:
- Alchemy API
- Alchemy API-metadata*
*includes many of the same features of Alchemy API but also allows more advanced batching of documents and keyword control.
Both of these services can support both text extraction and feature extraction. However, if you only need to perform text extraction Alchemy API should be used.
Alchemy API Configuration
You can use engineConfig
to pass the parameters of the Alchemy API configuration, as described below
Parameter | Description | ||
---|---|---|---|
postproc
| Possible values: "1","2","3" Default value is "3."
| ||
"1" does some post-processing of geographic entities (AlchemyAPI tends to prefer US results even when the context clearly indicates a US location),
| |||
"2" does some post-processing of person entities (AlchemyAPI tends to prefer famous people even when the context does not support that) | |||
"3" does both. | |||
sentiment | Possible values: True or False. Default value is True. If enabled, a sentiment metric is attached to each extracted entity.
| ||
concepts | Possible values: True or false. Default value is false. If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document.
|
Example 1: Using Alchemy API As A Text Extractor
In the example below, Alchemy API is only used as a text extractor. As such most of the configuration parameters are not applicable and the default settings can be taken. In this specific example, featureEngine
uses OpenCalais.
Source Configuration:
Code Block |
---|
{
"description": "Article on Medical Issues",
"harvestBadSource": false,
"isApproved": true,
"isPublic": true,
"key": "http.www.mayoclinic.com.rss.blog.xml",
"mediaType": "News",
"modified": "Oct 19, 2010 11:31:59 AM",
"tags": [
"topic:healthcare",
"industry:healthcare",
"mayo clinic",
"health"
],
"title": "MayoClinic: General Topics",
"processingPipeline": [
{
"feed": {
"extraUrls": [
{
"url": "http://www.mayoclinic.com/rss/blog.xml"
}
]
}
},
{
"textEngine": {
"engineName": "AlchemyAPI"
}
},
{
"featureEngine": {
"engineName": "OpenCalais"
}
}
]
}
|
Output:
The output contains the "description" and entities resulting from the textEngine
and featureEngine
settings.
{
"_id" : "4e1c8afa7d56bb818ed10f76",
"created" : "1310493434159",
"description" : "Clarify the role of carbohydrates in the Dr. Bernstein diet and find a
healthy eating plan that works for you.",
"entities" : [
{
"actual_name" : "certified diabetes",
"dimension" : "What",
"disambiguous_name" : "certified diabetes",
"doccount" : NumberLong(38),
"frequency" : 3,
"gazateer_index" : "certified diabetes/medicalcondition",
"relevance" : "0.711",
"totalfrequency" : NumberLong(114),
"type" : "MedicalCondition"
},
{
"actual_name" : "Diabetes Unit",
"dimension" : "Who",
"disambiguous_name" : "Diabetes Unit",
"doccount" : NumberLong(38),
"frequency" : 1,
"gazateer_index" : "diabetes unit/organization",
"relevance" : "0.235",
"totalfrequency" : NumberLong(38),
"type" : "Organization"
},
{
"actual_name" : "Mayo Clinic",
"dimension" : "What",
"disambiguous_name" : "Mayo Clinic",
"doccount" : NumberLong(514),
"frequency" : 2,
"gazateer_index" : "mayo clinic/facility",
"relevance" : "0.305",
"totalfrequency" : NumberLong(1033),
"type" : "Facility"
},
Alchemy API metadata
You can use engineConfig
to pass the parameters of the Alchemy API metadata configuration, as described below.
Parameter | Description | Data Type | ||
---|---|---|---|---|
sentiment | Possible values: True or false False is default value. If enabled, a sentiment metric is attached to each extracted entity.
| |||
concepts | Possible values: True or false. True is default setting. If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document.
| |||
batchSize | a string containing an integer, turned off by default. If turned on, the AlchemyAPI call goes out on a batch of documents (the specified number). This makes processing of small documents like tweets more economical (in return for a reduction in accuracy, eg the sentiment is calculated over the batch not each individual tweet). | string, integer | ||
numKeywords | a string containing an integer, uses the AlchemyAPI default (currently 50) if not specified. If specified, controls the number of keywords returned. If batching is enabled then the requested number is multiplied by the batch size. | string, integer | ||
strict | Possible values: True or False. False is default setting. If enabled, fewer high quality keywords are extracted.
|
Example 2: Using Alchemy API-metadata for Feature Extraction
In this example, Alchemy API metadata is used for feature extraction. It is configured to act on a batch of documents (100) and to return a maximum of 5 keywords per document. The strict setting will return more high quality keywords, and less keywords overall.
Source Configuration:
The source configuration shows how Alchemy API Metadata parameters can be used to set batch sizing and keywords settings. In addition, the beginning of the entities block is included to show how automatic feature extraction and manual entities can be combined to achieve highly customizable results.
Code Block |
---|
},
{
"featureEngine": {
"engineName": "AlchemyAPI-metadata",
"engineConfig": {
"app.alchemyapi-metadata.batchSize": 100,
"app.alchemyapi-metadata.numKeywords": 5,
"app.alchemyapi-metadata.strict": "true"
}
}
},
{
"entities": [
{
"actual_name": "$metadata.json.actor.displayName",
"dimension": "Who",
"disambiguated_name": "$metadata.json.actor.preferredUsername",
"linkdata": "$metadata.json.actor.link",
"type": "TwitterHandle"
}, |
Output:
The output reveals the results of featureEngine
and entities
. The entities are returned indexed by keyword.
Code Block |
---|
},
{
"actual_name": "Amex Teams",
"dimension": "What",
"disambiguated_name": "Amex Teams",
"doccount": -1,
"frequency": 1,
"index": "amex teams/keyword",
"relevance": 0.758636,
"sentiment": 0.160753,
"totalfrequency": -1,
"type": "Keyword"
},
{
"actual_name": "Halo",
"dimension": "What",
"disambiguated_name": "Halo",
"doccount": -1,
"frequency": 1,
"index": "halo/keyword",
"relevance": 0.461833,
"sentiment": 0.168822,
"totalfrequency": -1,
"type": "Keyword"
},
{
"actual_name": "Master Chief Incentives",
"dimension": "What",
"disambiguated_name": "Master Chief Incentives",
"doccount": -1,
"frequency": 1,
"index": "master chief incentives/keyword",
"relevance": 0.981457,
"sentiment": 0.168876,
"totalfrequency": -1,
"type": "Keyword"
}, |
...
Regex
TODO IN PROGRESS
Unlike many other extractors, the regex extractor does nothing by default - the configuration defines its functionality, and can vary from simple to more sophisticated. This section describes different cases:
Step 1: Define regexes that create entities, running against the document's fullText
- The keys are either the type, or the type-then-"/"-then-the dimension (eg "Person" or "Person/Who"). If the dimension is not specified then the system tries to guess, defaulting to "Who"
- The value is the regex, in one of the following formats
- "/<regex-pattern>/<optional-flags>" - will extract the entire pattern matched as an entity
- "s/<regex-pattern>/<replacement-string/<flags>" - will extract the replacement string (with $1, $2, etc representing the capturing groups)"
Example 1:
Code Block |
---|
{ "featureEngine": { "engineName": "regex", "engineConfig": { "Sha256Hash": "/[0-9a-fA-F]{64}/", "ExternalIp/Who": "s/(?:^|[^0-9a-z])([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)($|[^0-9a-z])/hash:$1/i" } } } |
Step 2: Define the default fields over which to search
By default, only the doc.fullText field is searched. The special key "$" defines the fields to be configured. It can have the following formats:
- comma-separated list of document fields, eg "fullText,description.title"
- (the field list can also include "field variables" - this is discussed below under Step 4)
- a single regex in the format "/<regex-pattern>/<optional-fields>" - this regex is applied to each field in the document and only matching fields are scanned.
Example 2 and 3:
Code Block |
---|
{ "featureEngine": { "engineName": "regex", "engineConfig": { "$": "fullText,description,title", "StreetAddress/Where": "/[0-9]+ [a-z_-]+ (?:Road|Street|Avenue)/i" } } } //alternative - will search fullText, description, and any metadata field with "address" in the dot-notation path { "featureEngine": { "engineName": "regex", "engineConfig": { "$": "/(?:fullText|description|metadata\..*\.address.*)/", "StreetAddress/Where": "/[0-9]+ *,? *[a-z_-]+ *(?:Road|Street|Avenue)/i" } } } |
Step 3: Specify different regexes for different fields
It is possible to restrict individual regexes to run on a subset of fields. This is performed with the following "key":
...
The value field is the same (the regex to apply to the specified stream).
Example 4:
Code Block |
---|
{ "featureEngine": { "engineName": "regex", "engineConfig": { "url,sourceUrl/FileType/What": "s/\.([a-z]{3})$/$1/i", "/[^.]*url$|metadata\..*filename.*/i/FileName/What": "s/[^/]+\.[a-z]{3}$/i" } } } |
Step 4: Build up complex sets of fields using "field variables"
Finally, it is possible to build up more complex sets of field lists in increments using "field variables". In these cases the keys are in the format "$<saved-field-name>", eg "$regexList", "$fieldList" and the value is either the field list (which can itself include earlier "field variables") or a regex that is matched across fields, ie in the same way as the "$" default.
Example 5:
In this example, the default is a small number of fields. The "Sha256Hash" entity type only looks over those. The "FileType" regex will run across a much larger set of fields.
Code Block |
---|
{ "featureEngine": { "engineName": "regex", "engineConfig": { "$docFields": "fullText,description,title", "$extendedDocFields": "$docFields,url,sourceUrl", "$metaFields": "/metadata\..*content.*/i", "$moreFields":"$metaFields,$extendedDocFields", "$": "$docFields", "Sha256Hash": "/[0-9a-fA-F]{64}/" "$moreFields/FileType/What": "s/\.([a-z]{3})$/$1/i", } } } |
...
OpenCalais
The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig
parameter.
...
Parameter | Description | Data Type |
---|---|---|
store_raw_events | Possible values: True or false False by default. If enabled, a metadata field called "OpenCalaisEvents" is tagged to the document containing the raw JSON for events. This can be used to analyze new event definitions so they can be incorporated into the global OpenCalais configuration. It can also be used as a workaround via the structured analysis harvester where this is not possible. |
Examples
The following example source uses Alchemy API as the text engine, and OpenCalais as the feature engine. In both cases, the default configuration of these engines is used to output entities and associations for the ingested RSS data.
...
Parameter | Description | Note | Data Type | ||
---|---|---|---|---|---|
data_path | Specifies the path where salience should ingest data from. See examples below. | Salience 5.1.6867: When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data". Salience 5.1.1.7298: When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message. | |||
license_path | Specifies the path to the salience license. See examples below. | ||||
short_form_content | If "true" (default "false") then optimizes for short form content such as twitter. | ||||
generate_categories | If "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics). | ||||
generate_entities | If "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text. | ||||
generate_keywords | If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document).
| ||||
kw_score_threshold | If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.
| ||||
generate_keyword_associations | If "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations. | ||||
query_topic_file | Points to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories. | ||||
concept_topic_file | Points to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories. | ||||
concept_topic_explain | If "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions. | ||||
topics_to_tags | If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags.
| ||||
topics_to_entities | If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What". | ||||
geolocate_entities | If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience.
| ||||
topic_score_threshold | If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off. | ||||
evidence_threshold | If set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality. | ||||
doc_summary_size | The number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is). |
Examples
Setting the data path: Anchor salience data_path salience data_path
...
Should only need to be changed when running the harvest engine locally. Paths that don't start with "/" are assumed to be relative to BASEDIR.
...
TextRank
TODO examples
Panel |
---|
Legacy documentation:
Legacy documentation: |
...