Feature extraction
Format
{ "display": string, "featureEngine": { "criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed "enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration) "engineConfig":{"config_param_name",string,...},// The configuration object to be passed to the engine "entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only), "exitOnError": boolean // if true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue. } }
Description
Feature extraction uses text obtained from the text extraction stage to generate entities, associations, and potentially metadata. Text extraction is a separate stage in the pipeline with different extraction engines.
Most feature extractors require for text to have been extracted with a "textEngine" or "text" object before it in the pipeline, unless the data comes from file extractor (which automatically fills in a document's "fullText" field).
For a list of supported text extractors, see Automated text extraction.
The following table describes the parameters of the feature extraction configuration.
Field | Description |
---|---|
criteria | A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed |
engineName | The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration) |
engineConfig | The configuration object to be passed to the engine |
entityFilter | (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) |
assocFilter | (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) |
exitOnError | If true (default) true then errors during featureExtraction will cause the doc to be removed from the pipeline. If false, the processing will continue. |
engineConfig
The "engineConfig" configuration object is a set of key/value pairs of strings that depends on the extractor type, "pre-integrated" configurations are described below, eg:
{ "featureEngine": { "engineName": "salience", "engineConfig": { "salience.shortFormContent": "true", "salience.kw_score_threshold": "0.5" } } }
Supported Feature Extraction Engines
IKANOWS supports the following feature extraction engines:
- Textrank* ("textrank")
- OpenCalais* ("opencalais")
- AlchemyAPI** ("alchemyapi")
- AlchemyAPI-metadata** ("alchemyapi-metadata")
- salience* ("salience")
- regex* - a mechanism for converting regexes into entities from text or metadata ("regex")
*requires a text extractor beforehand.
**includes its own built-in text extractor, though can run behind an alternative text extractor also.
Textrank
A "behind-the-firewall" OSS solution that uses TextRank (with some OpenNLP-based pre-processing) to extract key phrases from the text.
Open Calais
Extracts entities and associations using the free OpenCalais service. No sentiment analysis function is available at this time.
OpenCalais will truncate text that is larger than 99KB.
The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig
parameter.
The parameters should all be prefixed by "app.opencalais."
Parameter | Description | Data Type |
---|---|---|
store_raw_events | Possible values: True or false False by default. If enabled, a metadata field called "OpenCalaisEvents" is tagged to the document containing the raw JSON for events. This can be used to analyze new event definitions so they can be incorporated into the global OpenCalais configuration. It can also be used as a workaround via the structured analysis harvester where this is not possible. |
Alchemy API
Uses the Named Entity Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). AlchemyAPI has the ability to extract associations (as well as much more), but this feature has not yet been integrated into the tool.
There are two Alchemy services that can be called:
- Alchemy API
- Alchemy API-metadata*
*includes many of the same features of Alchemy API but also allows more advanced batching of documents and keyword control.
AlchemyAPI will truncate text longer than 145KB.
Both of these services can support both text extraction and feature extraction. However, if you only need to perform text extraction Alchemy API should be used.
You can use engineConfig to pass configuration parameters to the Alchemy API service as follows:
Parameter | Description |
---|---|
postproc
| Possible values: "1","2","3" Default value is "3."
|
"1" does some post-processing of geographic entities (AlchemyAPI tends to prefer US results even when the context clearly indicates a US location),
| |
"2" does some post-processing of person entities (AlchemyAPI tends to prefer famous people even when the context does not support that) | |
"3" does both. | |
sentiment | Possible values: True or False. Default value is True. If enabled, a sentiment metric is attached to each extracted entity. Note that this results in use of an extra AlchemyAPI credit per document. |
concepts | Possible values: True or false. Default value is false. If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document. Note that this results in use of an extra AlchemyAPI credit per document. |
AlchemyAPI-metadata
Uses the Keyword Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). This service also tags AlchemyAPI Concepts to documents as metadata.
You can use engineConfig
to pass the parameters of the Alchemy API metadata configuration, as described below.
Parameter | Description | Data Type |
---|---|---|
sentiment | Possible values: True or false False is default value. If enabled, a sentiment metric is attached to each extracted entity. Note that this results in use of an extra AlchemyAPI credit per document. | |
concepts | Possible values: True or false. True is default setting. If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document. Note that this results in use of an extra AlchemyAPI credit per document. | |
batchSize | a string containing an integer, turned off by default. If turned on, the AlchemyAPI call goes out on a batch of documents (the specified number). This makes processing of small documents like tweets more economical (in return for a reduction in accuracy, eg the sentiment is calculated over the batch not each individual tweet). | string, integer |
numKeywords | a string containing an integer, uses the AlchemyAPI default (currently 50) if not specified. If specified, controls the number of keywords returned. If batching is enabled then the requested number is multiplied by the batch size. | string, integer |
strict | Possible values: True or False. False is default setting. If enabled, fewer high quality keywords are extracted.
|
Salience
Salience* is an embedded solution requiring no internet connectivity. Generates reliable topics, keywords, entities, and associations from text. Named entities and topics are customizable by users.
*Enterprise only
The following custom configuration parameters are possible for Salience and can be set using the engineConfig
parameter.
The following parameters should all be prefixed by "salience." (no app unlike the others)
Parameter | Description | Note | Data Type |
---|---|---|---|
data_path | Specifies the path where salience should ingest data from. See examples below. | Salience 5.1.6867: When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data". Salience 5.1.1.7298: When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message. | |
license_path | Specifies the path to the salience license. See examples below. | ||
short_form_content | If "true" (default "false") then optimizes for short form content such as twitter. | ||
generate_categories | If "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics). | ||
decompose_categories | If "true" (default "false"), and "generate_categories" is also "true", then will generate more granular sub-topics | ||
generate_entities | If "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text. | ||
generate_keywords | If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document). Note that Infinit.e keywords correspond to "themes" in Salience documentation. | ||
kw_score_threshold | If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.
| ||
generate_keyword_associations | If "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations. | ||
query_topic_file | Points to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories. | ||
concept_topic_file | Points to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories. | ||
concept_topic_explain | If "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions. | ||
topics_to_tags | If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags. Note that the Salience documentation refers to topics as both "concepts" or "tags" depending on how they are generated. | ||
topics_to_entities | If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What". | ||
geolocate_entities | If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience. NOTE: this functionality is not currently very accurate. If false positives are worse than true negatives then set this to "false".
| ||
topic_score_threshold | If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off. | ||
evidence_threshold | If set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality. | ||
doc_summary_size | The number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is). |
Setting the data path:
The data path for salience should be set using the following format:
"<BASEDIR>/data"
where BASEDIR is the environment variable "lxainstall" - will normally be set to "/opt/lexalytics/salience-5.x/"
salience configuration values do not need to be prefixed by 'app."
example format:
"salience.data_path"
Data path can be modified when running the harvest engine in a non-standard configuration. eg. running locally during development.
Setting the data path for language packs:
You can use the data_path
parameter to switch between different language packs.
To set the language pack for Spanish
"/opt/lexalytics/salience-5.x/spanish".
If the parameter doesn't start with a "/" then it is assumed to be relative to BASEDIR, eg "spanish" is sufficient in the preceding example.
Setting the license_path:
You can use the license_path
parameter to specify the license for salience.
For example
"<BASEDIR>/license.v5".
Should only need to be changed when running the harvest engine locally. Paths that don't start with "/" are assumed to be relative to BASEDIR.
date_extractor
The Date Extractor* esigned to find the published date posted in a document's full text and set it as the document's published date.
*Enterprise Only
The following custom configuration parameters are possible for the Date Extractor and can be set using the engineConfig
parameter.
The following parameters should all be prefixed by "date."
Parameter | Description | Note | Data Type |
---|---|---|---|
check_url | If set to true, the URL will be parsed for a published date. ex: http://newsdomain.com/article/05/06/2015/Article-Title. | Default Value: true Processing Order**: 2 | boolean |
check_meta_tags | If set to true, html meta tag content with names related to dates will be parsed. ex: <meta name="pubdate" content="04/16/2007" /> | Default Value: true Processing Order**: 3 | boolean |
check_fulltext | If set to true, will attempt to parse from the document fulltext. | Default Value: true Processing Order**: 5 | boolean |
strip_html_from_fulltext | If set to true (and check_fulltext == true) document fulltext in html format will be stripped to only search text. Note: This removes a href values to avoid pulling dates in hyperlinks. | Default Value: true Processing Order**: 5 | boolean |
check_html_after_strip | If set to true (and strip_html_from_fulltext == true) this will search the raw html if a date was not found in the stripped content. | Default Value: true Processing Order**: 6 | boolean |
future_block | If set to true, parsed dates that take place in the future will not be considered. | Default Value: true | boolean |
prio_day_before_month | If set to true, dates that begin with the 'day' field eg dd/mm/yyyy will be prioritized. For example, if set to true, 11/12/2014 would be parsed as December 11th, 2014. If set to False, the same date would be parsed as November 12th, 2014. | Default Value:false | boolean |
metadata_key | Value of document metadata key that the extractor should look for a date first. If Null, this is skipped Note: Key specified should have the string value containing the day at the zero index of the array. This will not crawl. | Default Value: NULL Processing Order**: 1 | String |
proximity_words | Comma-delimited keywords in a document that known to precede document published dates. Currently, the 100 characters following the keyword will be considered. Note: To disable this, a value of " " should be set, else the default keywords will be used. | Default Value: "posted,updated,edited" Processing Order**: 4 | String |
debug | Adds debug prints to the console | Default Value: false | boolean |
tometa | If true, adds metadata value of "DATE_EXTRACTION_NOTES" to the document to explain where the date was parsed from. | Default Value: false | boolean |
**Processing Order Refers to the order that the values are checked (if enabled.) Logic with the processing order of 1 will be checked first if it is enabled, other wise 2 will be checked, and so on. Once a value is found, the searching will not continue and that value will become the document published date.
Regex
Unlike many other extractors, the regex extractor does nothing by default - the configuration defines its functionality, and can vary from simple to more sophisticated.
When a regex is specified for engineConfig
, you are not simply sending the fields of the configuration to an external service. Instead, you can specify regexes in order to create entities based on a document's fullText, or other fields.
The following use cases are supported:
- By default, run a regex against the document fullText to create an Entity using entity type/dimension.
- Use $ to define the fields which will be configured by the regex, and then specify entity using entity type/dimension.
- Specify different regexes for different fields, and for each generated entity specify entity type/dimension
- build up complex sets of fields using field variables, and specify multiple regexes to act on specific variables.
For more information, see detailed examples below.
Examples
This section provides detailed examples for the supported feature extractors.
Alchemy API
Using Alchemy API As A Text Extractor
In the example below, Alchemy API is only used as a text extractor. As such most of the configuration parameters are not applicable and the default settings can be taken. In this specific example, featureEngine
uses OpenCalais.
Source Configuration:
{ "description": "Article on Medical Issues", "harvestBadSource": false, "isApproved": true, "isPublic": true, "key": "http.www.mayoclinic.com.rss.blog.xml", "mediaType": "News", "modified": "Oct 19, 2010 11:31:59 AM", "tags": [ "topic:healthcare", "industry:healthcare", "mayo clinic", "health" ], "title": "MayoClinic: General Topics", "processingPipeline": [ { "feed": { "extraUrls": [ { "url": "http://www.mayoclinic.com/rss/blog.xml" } ] } }, { "textEngine": { "engineName": "AlchemyAPI" } }, { "featureEngine": { "engineName": "OpenCalais" } } ] }
Output:
The output contains the "description" and entities resulting from the textEngine
and featureEngine
settings.
{ "_id" : "4e1c8afa7d56bb818ed10f76", "created" : "1310493434159", "description" : "Clarify the role of carbohydrates in the Dr. Bernstein diet and find a healthy eating plan that works for you.", "entities" : [ { "actual_name" : "certified diabetes", "dimension" : "What", "disambiguous_name" : "certified diabetes", "doccount" : NumberLong(38), "frequency" : 3, "gazateer_index" : "certified diabetes/medicalcondition", "relevance" : "0.711", "totalfrequency" : NumberLong(114), "type" : "MedicalCondition" }, { "actual_name" : "Diabetes Unit", "dimension" : "Who", "disambiguous_name" : "Diabetes Unit", "doccount" : NumberLong(38), "frequency" : 1, "gazateer_index" : "diabetes unit/organization", "relevance" : "0.235", "totalfrequency" : NumberLong(38), "type" : "Organization" }, { "actual_name" : "Mayo Clinic", "dimension" : "What", "disambiguous_name" : "Mayo Clinic", "doccount" : NumberLong(514), "frequency" : 2, "gazateer_index" : "mayo clinic/facility", "relevance" : "0.305", "totalfrequency" : NumberLong(1033), "type" : "Facility" },
Using Alchemy API-metadata for Feature Extraction
In this example, Alchemy API metadata is used for feature extraction. It is configured to act on a batch of documents (100) and to return a maximum of 5 keywords per document. The strict setting will return more high quality keywords, and less keywords overall.
Source Configuration:
The source configuration shows how Alchemy API Metadata parameters can be used to set batch sizing and keywords settings. In addition, the beginning of the entities block is included to show how automatic feature extraction and manual entities can be combined to achieve highly customizable results.
}, { "featureEngine": { "engineName": "AlchemyAPI-metadata", "engineConfig": { "app.alchemyapi-metadata.batchSize": 100, "app.alchemyapi-metadata.numKeywords": 5, "app.alchemyapi-metadata.strict": "true" } } }, { "entities": [ { "actual_name": "$metadata.json.actor.displayName", "dimension": "Who", "disambiguated_name": "$metadata.json.actor.preferredUsername", "linkdata": "$metadata.json.actor.link", "type": "TwitterHandle" },
Output:
The output reveals the results of featureEngine
and entities
. The entities are returned indexed by keyword.
}, { "actual_name": "Amex Teams", "dimension": "What", "disambiguated_name": "Amex Teams", "doccount": -1, "frequency": 1, "index": "amex teams/keyword", "relevance": 0.758636, "sentiment": 0.160753, "totalfrequency": -1, "type": "Keyword" }, { "actual_name": "Halo", "dimension": "What", "disambiguated_name": "Halo", "doccount": -1, "frequency": 1, "index": "halo/keyword", "relevance": 0.461833, "sentiment": 0.168822, "totalfrequency": -1, "type": "Keyword" }, { "actual_name": "Master Chief Incentives", "dimension": "What", "disambiguated_name": "Master Chief Incentives", "doccount": -1, "frequency": 1, "index": "master chief incentives/keyword", "relevance": 0.981457, "sentiment": 0.168876, "totalfrequency": -1, "type": "Keyword" },
Regex
TODO IN PROGRESS
Unlike many other extractors, the regex extractor does nothing by default - the configuration defines its functionality, and can vary from simple to more sophisticated. This section describes different cases:
Define regexes that create entities, running against the document's fullText
- The keys are either the type, or the type-then-"/"-then-the dimension (eg "Person" or "Person/Who"). If the dimension is not specified then the system tries to guess, defaulting to "Who"
- The value is the regex, in one of the following formats
- "/<regex-pattern>/<optional-flags>" - will extract the entire pattern matched as an entity
- "s/<regex-pattern>/<replacement-string/<flags>" - will extract the replacement string (with $1, $2, etc representing the capturing groups)"
Example 1:
{ "featureEngine": { "engineName": "regex", "engineConfig": { "Sha256Hash": "/[0-9a-fA-F]{64}/", "Who/ExternalIp": "s/(?:^|[^0-9a-z])([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)($|[^0-9a-z])/hash:$1/i" } } }
Define the default fields over which to search
By default, only the doc.fullText field is searched. The special key "$" defines the fields to be configured. It can have the following formats:
- comma-separated list of document fields, eg "fullText,description.title"
- (the field list can also include "field variables" - this is discussed below)
- a single regex in the format "/<regex-pattern>/<optional-fields>" - this regex is applied to each field in the document and only matching fields are scanned.
Example 2 and 3:
{ "featureEngine": { "engineName": "regex", "engineConfig": { "$": "fullText,description,title", "Where/StreetAddress": "/[0-9]+ [a-z_-]+ (?:Road|Street|Avenue)/i" } } } //alternative - will search fullText, description, and any metadata field with "address" in the dot-notation path { "featureEngine": { "engineName": "regex", "engineConfig": { "$": "/(?:fullText|description|metadata\\..*\\.address.*)/", "Where/StreetAddress": "/[0-9]+ *,? *[a-z_-]+ *(?:Road|Street|Avenue)/i" } } }
Specify different regexes for different fields
It is possible to restrict individual regexes to run on a subset of fields. This is performed with the following "key":
- "<field-list>/<entity-type>/<dimension>" - comma-separated list of fields to search (can also include "field variables" - see below)
- "<field-list>/<entity-type>" (same as above but with automatically inferred dimension)
- EXAMPLE: "url,sourceUrl/FileType/What" - searches only the documents' "url" and "sourceUrl" fields
- "/<regex-pattern/<flags>/<entity-type>/<dimension>" - specifies a regex to apply to all fields, only matching ones are scanned
- "/<regex-pattern/<flags>/<entity-type>" (same as above but with automatically inferred dimension)
- EXAMPLE: "/[^.]*url$|metadata\..*filename.*/i/FileName/What" - will only scan top level document fields ending with url (url and sourceUrl), or metadata fields containing the string "fieldname" in the dot notation path
The value field is the same (the regex to apply to the specified stream).
Example 4:
{ "featureEngine": { "engineName": "regex", "engineConfig": { "url,sourceUrl/What/FileType": "s/\\.([a-z]{3})$/$1/i", "/[^.]*url$|metadata\\..*filename.*/i/What/FileName": "s/[^\\/]+\\.[a-z]{3}$/i" } } }
Build up complex sets of fields using "field variables"
Finally, it is possible to build up more complex sets of field lists in increments using "field variables". In these cases the keys are in the format "$<saved-field-name>", eg "$regexList", "$fieldList" and the value is either the field list (which can itself include earlier "field variables") or a regex that is matched across fields, ie in the same way as the "$" default.
Example 5:
In this example, the default is a small number of fields. The "Sha256Hash" entity type only looks over those. The "FileType" regex will run across a much larger set of fields.
{ "featureEngine": { "engineName": "regex", "engineConfig": { "$docFields": "fullText,description,title", "$extendedDocFields": "$docFields,url,sourceUrl", "$metaFields": "/metadata\\..*content.*/i", "$moreFields":"$metaFields,$extendedDocFields", "$": "$docFields", "Sha256Hash": "/[0-9a-fA-F]{64}/" "$moreFields/What/FileType": "s/\.([a-z]{3})$/$1/i", } } }