Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

IKANOWS supports the following feature extraction engines:

  • Textrank* - TODO description from legacy page
  • OpenCalais* - TODO description from legacy page
  • AlchemyAPI** - TODO description from legacy page
  • AlchemyAPI-metadata** - TODO description from legacy page
  • salience* - TODO description from legacy page
  • regex* - a mechanism for converting regexes into entities from text or metadata

*requires a text extractor beforehand.

...

Code Block
{
	"featureEngine": {
		"engineName": "salience",
		"engineConfig": {
			"salience.shortFormContent": "true",
			"salience.kw_score_threshold": "0.5"
		}
	}
}

...

Standard feature extractors

This section describes the configuration details for the supported extractors, and provides examples where applicable.

Regex

TODO IN PROGRESS

Unlike many other extractors, the regex extractor does nothing by default - the configuration defines its functionality, and can vary from simple to more sophisticated. This section describes different cases:

Step 1: Define regexes that create entities, running against the document's fullText
  • The keys are either the type, or the type-then-"/"-then-the dimension (eg "Person" or "Person/Who"). If the dimension is not specified then the system tries to guess, defaulting to "Who"
  • The value is the regex, in one of the following formats
    • "/<regex-pattern>/<optional-flags>" - will extract the entire pattern matched as an entity
    • "s/<regex-pattern>/<replacement-string/<flags>" - will extract the replacement string (with $1, $2, etc representing the capturing groups)"

Example 1:

Code Block
{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"Sha256Hash": "/[0-9a-fA-F]{64}/",
			"ExternalIp/Who": "s/(?:^|[^0-9a-z])([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)($|[^0-9a-z])/hash:$1/i"
		}
	}
}
Step 2: Define the default fields over which to search

By default, only the doc.fullText field is searched. The special key "$" defines the fields to be configured. It can have the following formats:

  • comma-separated list of document fields, eg "fullText,description.title"
    • (the field list can also include "field variables" - this is discussed below under Step 4)
  • a single regex in the format "/<regex-pattern>/<optional-fields>" - this regex is applied to each field in the document and only matching fields are scanned.

Example 2 and 3: 

Code Block
{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"$": "fullText,description,title",
			"StreetAddress/Where": "/[0-9]+ [a-z_-]+ (?:Road|Street|Avenue)/i"
		}
	}
}
//alternative - will search fullText, description, and any metadata field with "address" in the dot-notation path
{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"$": "/(?:fullText|description|metadata\..*\.address.*)/",
			"StreetAddress/Where": "/[0-9]+ *,? *[a-z_-]+ *(?:Road|Street|Avenue)/i"
		}
	}
}
Step 3: Specify different regexes for different fields

It is possible to restrict individual regexes to run on a subset of fields. This is performed with the following "key":

  • "<field-list>/<entity-type>/<dimension>" - comma-separated list of fields to search (can also include "field variables" - see below)
  • "<field-list>/<entity-type>" (same as above but with automatically inferred dimension)
    • EXAMPLE: "url,sourceUrl/FileType/What" - searches only the documents' "url" and "sourceUrl" fields
  • "/<regex-pattern/<flags>/<entity-type>/<dimension>" - specifies a regex to apply to all fields, only matching ones are scanned
  • "/<regex-pattern/<flags>/<entity-type>" (same as above but with automatically inferred dimension)
    • EXAMPLE: "/[^.]*url$|metadata\..*filename.*/i/FileName/What" - will only scan top level document fields ending with url (url and sourceUrl), or metadata fields containing the string "fieldname" in the dot notation path

The value field is the same (the regex to apply to the specified stream).

Example 4:

Code Block
{
	"featureEngine": {
		"engineName": "regex",
		"engineConfig": {
			"url,sourceUrl/FileType/What": "s/\.([a-z]{3})$/$1/i",
			"/[^.]*url$|metadata\..*filename.*/i/FileName/What": "s/[^/]+\.[a-z]{3}$/i"
		}
	}
}
Step 4: Build up complex sets of fields using "field variables"

XXX

OpenCalais

The following custom configuration parameters are possible for Open Calais and can be set using the engineConfig parameter.

...

ParameterDescriptionData Type
store_raw_events

Possible values:

True or false

False by default.

If enabled, a metadata field called "OpenCalaisEvents" is tagged to the document containing the raw JSON for events. This can be used to analyze new event definitions so they can be incorporated into the global OpenCalais configuration. It can also be used as a workaround via the structured analysis harvester where this is not possible.

 

Examples

The following example source uses Alchemy API as the text engine, and OpenCalais as the feature engine.  In both cases, the default configuration of these engines is used to output entities and associations for the ingested RSS data.

...

ParameterDescriptionNoteData Type
data_path

Specifies the path where salience should ingest data from.

See examples below.

Salience 5.1.6867:

When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data".

Salience 5.1.1.7298:

When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message.

 
license_path

Specifies the path to the salience license.

See examples below.

  
short_form_contentIf "true" (default "false") then optimizes for short form content such as twitter.  
generate_categoriesIf "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics).
  
generate_entitiesIf "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text.
  
generate_keywords

If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document).

Info

Note that Infinit.e keywords correspond to "themes" in Salience documentation.

  
kw_score_threshold

If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.

 

  
generate_keyword_associationsIf "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations.  
query_topic_filePoints to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories.  
concept_topic_filePoints to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories.  
concept_topic_explainIf "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions.  
topics_to_tags

If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags.

Info

Note that the Salience documentation refers to topics as both "concepts" or "tags" depending on how they are generated.


  
topics_to_entities If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What".  
geolocate_entities

If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience.

Info

NOTE: this functionality is not currently very accurate. If false positives are worse than true negatives then set this to "false".

 

 

  
topic_score_threshold If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off.
  
evidence_thresholdIf set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality.  
doc_summary_sizeThe number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is).  

 

Examples

Anchor
salience data_path
salience data_path
Setting the data path:

...