Enrichment engines

Infinit.e's entity extractors take harvested documents, ie URLs (RSS/HTML), text (files), or metadata objects (XML, databases), and add meaning in the form of entities and associations between entities.

Examples of the built-in entity extractors (JSON field "useExtractor") include:

  • "textrank" - A "behind-the-firewall" OSS solution that uses TextRank (with some OpenNLP-based pre-processing) to extract key phrases from the text.
  • "AlchemyAPI" - uses the Named Entity Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). AlchemyAPI has the ability to extract associations (as well as much more), but this feature has not yet been integrated into the tool.
  • "OpenCalais" - extracts entities and associations using the free OpenCalais service. No sentiment analysis function is available at this time.
  • "AlchemyAPI-metadata" - uses the Keyword Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). This service also tags AlchemyAPI Concepts to documents as metadata.
  • (Enterprise only) "salience" - an embedded solution requiring no internet connectivity. Generates reliable topics, keywords, entities, and associations from text. Named entities and topics are customizable by users.
  • Note that IKANOW has also integrated with behind-the-firewall entity extractors. These are typically commercial or GOTS products  Please contact us for more details. 
  • We build custom extractors and taggers based on OpenNLP, contact us for more details.

In addition to the above entity extractors, Infinit.e has three options for "text extractors", which convert URLs into text (eg Advert removal, HTML tag cleansing etc):

In either of the above cases ("useExtractor" or "useTextExtractor"), the field can be set to "none".

Per-source configuration for extractor engines

The "extractorOptions" field of the source JSON object allows for custom configuration of text and entity extractors.

The format of the object is in the form:

{
	//"app.<EXTRACTOR_NAME>.<PARAMETER_NAME>": "<PARAMETER_VALUE>"
	//...
	//eg:
	"app.alchemyapi-metadata.sentiment": "true"
}

Currently the following configuration options are available:

  • alchemyapi:
    • postproc: "1", "2", "or "3", "3" by default. "1" does some post-processing of geographic entities (AlchemyAPI tends to prefer US results even when the context clearly indicates a US location), "2" does some post-processing of person entities (AlchemyAPI tends to prefer famous people even when the context does not support that), "3" does both.
    • sentiment: "true"/"false", "true" by default. If enabled, a sentiment metric is attached to each extracted entity. Note that this results in use of an extra AlchemyAPI credit per document.
    • concepts: "true"/"false", "false" by default. If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document. Note that this results in use of an extra AlchemyAPI credit per document.
  • alchemyapi-metadata:
    • sentiment: "true"/"false", "true" by default. If enabled, a sentiment metric is attached to each extracted entity. Note that this results in use of an extra AlchemyAPI credit per document.
    • concepts: "true"/"false", "true" by default. If enabled, a metadata field called "concepts" is tagged to the document containing Wiki titles that are related to the contents of the document. Note that this results in use of an extra AlchemyAPI credit per document.
    • batchSize: a string containing an integer, turned off by default. If turned on, the AlchemyAPI call goes out on a batch of documents (the specified number). This makes processing of small documents like tweets more economical (in return for a reduction in accuracy, eg the sentiment is calculated over the batch not each individual tweet).
    • numKeywords: a string containing an integer, uses the AlchemyAPI default (currently 50) if not specified. If specified, controls the number of keywords returned. If batching is enabled then the requested number is multiplied by the batch size.
    • strict: "true"/"false", "false" by default. If enabled, fewer high quality keywords are extracted.
  • opencalais:
    • store_raw_events: "true"/"false", "false" by default. If enabled, a metadata field called "OpenCalaisEvents" is tagged to the document containing the raw JSON for events. This can be used to analyze new event definitions so they can be incorporated into the global OpenCalais configuration. It can also be used as a workaround via the structured analysis harvester where this is not possible. 
  • textrank: currently none
  • boilerpipe: currently none
  • tika: currently none
  • salience (note these aren't prefixed by "app." like for the other extractors, eg just "salience.data_path")
    • data_path: Defaults to "<BASEDIR>/data" (where BASEDIR is the environment variable "lxainstall" - will normally be set to "/opt/lexalytics/salience-5.x/"), but can be modified either when running the harvest engine in a non-standard configuration (eg locally during development), 
      • ...but also more importantly to switch between different languages packs - eg "/opt/lexalytics/salience-5.x/spanish". If the parameter doesn't start with a "/" then it is assumed to be relative to BASEDIR, eg "spanish" is sufficient in the preceding example.
        • When running Salience 5.1.6867, twitter data should use "data_path": "twitter_data". When running Salience 5.1.1.7298, a different parameter (short_form_content) will be used to optimize for short form message.
    • license_path: Defaults to "<BASEDIR>/license.v5". should only need to be changed when running the harvest engine locally. As above, paths that don't start with "/" are assumed to be relative to BASEDIR.
    • short_form_content: If "true" (default "false") then optimizes for short form content such as twitter.
    • generate_categories: If "true" (default: "false") then tries to extract named category topics. It is currently not possible to specify a user file for this topic type (unlike concepts and query topics).
    • generate_entities: If "true" (the default) then tries to extract named entities (people, places, organizations, dates, etc) from the text.
    • generate_keywords: If "true" (the default) then generates keywords (ie words or phrases in the document that are central to the meaning of the document). Note that Infinit.e keywords correspond to "themes" in Salience documentation.
    • kw_score_threshold: If set then keywords with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off. 
    • generate_keyword_associations: If "true" (note string not boolean; defaults to "false") then generates associations from entities and topics to keywords - this is off by default because it tends to generate quite a lot of low value associations.
    • query_topic_file: Points to the file that defines query-based topics. By default, uses high-level categories. Set to "disable" to disable categories.
    • concept_topic_file: Points to the file that defines concept-based topics. By default, uses high-level categories. Set to "disable" to disable categories.
    • concept_topic_explain: If "true" (default: "false") then creates associations linking concept topics to the keywords that generated them. This can be used for better understanding which words should be used inside the concept definitions.
    • topics_to_tags: If "true" (the default) then topics eg "Education", "Technology") are appended to the document tags. Note that the Salience documentation refers to topics as both "concepts" or "tags" depending on how they are generated.
    • topics_to_entities: If "true" (default: "false") then topics eg "Education", "Technology") are appended to the document as entities with type "Topic", dimension "What".
    • geolocate_entities: If "true" (default) then will try to geo-locate any "Place" entities extracted by Salience. 
      • NOTE: this functionality is not currently very accurate. If false positives are worse than true negatives then set this to "false".
    • topic_score_threshold: If set then topics with a lower score than this threshold (between "0.0" and "1.0") are discarded - this allows a precision-recall (quality/quantity) trade-off. 
    • evidence_threshold: If set then entity sentiments generated on the basis of less evidence than this threshold (between "0" and "10") are discarded. This generates fewer sentiments but of higher quality.
    • doc_summary_size: The number of sentences used to fill in the document description. Defaults to "3". Set to "0" to disable summarization (the description is left as is).