Processing in IKANOW

There are two types of complex processing possible in IKANOW's Infinit.e platform:

IKANOW uses the popular Hadoop ecosystem to power its custom processing capabilities, integrating in our output, management, monitoring and security layers.

The following diagram shows how these 2 different activities are related:

The same JSON-based configuration language, with associated UI, can be used to build and maintain both types of pipeline. Typically the elements don't mix (ie a pipeline consists entirely of elements from either the "standard" set or the "custom" set, though there are some exceptions described below.

Input Sources

Overview

From the October 2013 release (alpha), harvesting and enrichment is a more logical process based around the concept of applying a pipeline of processing elements to documents emanating from a source.

This is illustrated visually below:

Note that in practice there is considerable flexibility in the order of pipeline elements. They must start with an extractor, and the global elements must be next, and (currently) "Follow Web links" must be next, and for Web/Feed extractors only.

Aside from that the pipeline elements can be in any order and have any cardinality.

As can be seen in the diagram above, the pipeline elements can be approximately grouped into the following categories:

The remainder of this page provides more details on each of the pipeline elements available under each of these categories.

IN PROGRESS: most of these links currently provide the current object used to configure the corresponding pipeline element (in either JSON or Java POJO format), and point to the legacy documentation for the functionality being replaced.

Document extractors

Global processing

Secondary extractors

Text processing

Metadata enrichment

Entities and associations

Storage and Indexing settings

Other topics

Custom Processing Sources

Overview

From the November 2014 release, the source editor can be used to build and maintain custom processing. (Prior to that it was necessary to use the Plugin Manager UI).

As described in the overview above, the custom processing engine is a highly customizable Hadoop-based workflow. In essence:

The remainder of this section categorizes the different elements that can be used to build up functionality:

Custom Inputs

Custom Control

Custom Processing

Custom Outputs