Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Overview

This toolkit element passes the document text to an external (or embedded) extraction engine to return entities and associations (and optionally metadata).

Most feature extractors require for text to have been extracted with a "textEngine" or "text" object before it in the pipeline, unless the data comes from file (which automatically fills in a document's "fullText" field). AlchemyAPI is an exception for URLs because it can do both steps. Other custom extractors may not require text, eg because they operate on existing metadata fields, or entities etc.

Format

{
	"display": string,
	"featureEngine": {
		"criteria":string,// A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed
		"enginename":string,// The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)
		"engineConfig"{"config_param_name",string,...}// The configuration object to be passed to the engine
		"entityFilter":string,// (regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter":string,// (regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) 
}
}

 

Description

Many of the automated text extraction tools can also perform the creation of entities and associations.

For example,the following engines can perform both extraction of text, as well as creation of entities and associations.

  • Alchemy API
  • boilerpipe
  • tika

For a description of supported engines, see Automated text extraction.

The following table describes the parameters of the feature extraction configuration.

FieldDescription
criteria

A javascript expression that is passed the document as _doc - if returns false then this pipeline element is bypassed

enginename

The name of the text engine to use (can be fully qualified (eg "com.ikanow.infinit.e.harvest.boilerpipe"), or just the name (eg "boilerpipe") if the engine is registered in the Infinit.e system configuration)

engineConfig

The configuration object to be passed to the engine

entityFilter

(regex applied to entity indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)

assocFilter

(regex applied to new-line separated association indexes, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)

Examples

Specifying the Feature Engine

The following example source uses Alchemy API as the text engine, and OpenCalais as the feature engine.  In both cases, the default configuration of these engines is used to output entities and associations for the ingested RSS data.

 

{
    "description": "Article on Medical Issues",
    "harvestBadSource": false,
    "isApproved": true,
    "isPublic": true,
    "key": "http.www.mayoclinic.com.rss.blog.xml",
    "mediaType": "News",
    "modified": "Oct 19, 2010 11:31:59 AM",
    "tags": [
        "topic:healthcare",
        "industry:healthcare",
        "mayo clinic",
        "health"
    ],
    "title": "MayoClinic: General Topics",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://www.mayoclinic.com/rss/blog.xml"
                    }
                ]
            }
        },
        {
            "textEngine": {
                "engineName": "AlchemyAPI"
            }
        },
        {
            "featureEngine": {
                "engineName": "OpenCalais"
            }
        }
    ]
}


enginConfig Example

You can use the engineConfig object to pass configuration parameters along to the feature engine.

In this example, the Alchemy API is configured to act on a batch of documents (100) and to return a maximum of 5 keywords per document.   The strict setting will return more high quality keywords, and less keywords overall.

    },        {
            "featureEngine": {
                "engineName": "AlchemyAPI-metadata",
                "engineConfig": {
                    "app.alchemyapi-metadata.batchSize": 100,
                    "app.alchemyapi-metadata.numKeywords": 5,
                    "app.alchemyapi-metadata.strict": "true"
                }
            }
        },

 

For documentation of possible engineConfig parameters, see section Automated text extraction.

 

Legacy documentation:

  • Replaces "useExtractor" in the Source object
  • (Note "criteria" above is not currently supported - coming soon!)

TODO

Legacy documentation:

TODO

 

 

 

  • No labels