Source Pipeline Documentation

Processing in IKANOW

There are two types of complex processing possible in IKANOW's Infinit.e platform:

  • Input. transform and storage of data of many different types - into documents or records
    • Documents are larger and more complex objects typically generated from more complex XML/JSON, natural language heavy things like web-sites and reports
      • The Infinit.e platform enables a powerful pipeline of templated operations to transform these data types into our "generic document model"
    • Records are smaller objects like single line log records, simple JSON objects, SQL records etc
      • The Infinit.e platform places almost no restrictions on the format of the JSON or how it is imported into the system, though we integrate particularly well with the popular community-driven platform "logstash"
  • Applying custom logic to existing documents and records in order to enrich the system with new data and functionality like:
    • "Reports" - eg spreadsheet-like or statistical data containing directly actionable information
    • New records and documents - typically alerts, or aggregate "events" made up of multiple documents/records
    • Lookup tables, which can then be used to enrich new and existing documents (eg local asset information), generate alerts (eg malicious domains)

IKANOW uses the popular Hadoop ecosystem to power its custom processing capabilities, integrating in our output, management, monitoring and security layers.

The following diagram shows how these 2 different activities are related:

The same JSON-based configuration language, with associated UI, can be used to build and maintain both types of pipeline. Typically the elements don't mix (ie a pipeline consists entirely of elements from either the "standard" set or the "custom" set, though there are some exceptions described below.

Input Sources

Overview

From the October 2013 release (alpha), harvesting and enrichment is a more logical process based around the concept of applying a pipeline of processing elements to documents emanating from a source.

This is illustrated visually below:

Note that in practice there is considerable flexibility in the order of pipeline elements. They must start with an extractor, and the global elements must be next, and (currently) "Follow Web links" must be next, and for Web/Feed extractors only.

Aside from that the pipeline elements can be in any order and have any cardinality.

  • For example you could create metadata from raw HTML (using xpath), then have an automated text extractor, then pull our more metadata using regex/javascript, then return to the original raw text, and then run a different automated extractor before creating entities.
  • A very useful scenario involves running the data through several entities extractors, potentially using the "criteria" field to choose which one to run based on the content and metadata extracted so far

As can be seen in the diagram above, the pipeline elements can be approximately grouped into the following categories:

  • Extractors: generate mostly empty Infinit.e documents from external data sources
  • Globals: generate javascript artifacts that can be used by subsequent pipeline elements
  • Secondary extractors: Enables new documents to be spawned from the existing metadata
  • Text extraction: manipulation of the raw document content
  • Metadata: generate document metadata such as title, description, date; and arbitrary content metadata using xpath, regex, and javascript
  • Entities and associations: create entities and assocations out of the text
  • Storage and indexing: decide which documents to keep, what fields to keep, and what to full text index (for searching using the GUI/API)

The remainder of this page provides more details on each of the pipeline elements available under each of these categories.

IN PROGRESS: most of these links currently provide the current object used to configure the corresponding pipeline element (in either JSON or Java POJO format), and point to the legacy documentation for the functionality being replaced.

Document extractors

Global processing

Secondary extractors

Text processing

Metadata enrichment

Entities and associations

Storage and Indexing settings

Other topics

Custom Processing Sources

Overview

From the November 2014 release, the source editor can be used to build and maintain custom processing. (Prior to that it was necessary to use the Plugin Manager UI).

As described in the overview above, the custom processing engine is a highly customizable Hadoop-based workflow. In essence:

  • It takes in records and documents primarily
    • Also external files and results from already-run custom jobs
  • It applies different types of processing to the data:
    • Generic templated business logic using Java or scripting languages (currently JS, Python is on the roadmap)
    • A growing set of built-in configurable functions
      • Generic high speed aggregation
      • Generic high speed filtering and joining on other data sources
      • Data format exploration
    • Roadmap: real-time analytics using Storm-over-Hadoop
  • The output of the processing is (with a few exceptions) objects representing things like:
    • New documents or records (alerts, groups of documents eg tweets)
    • Lookup tables
    • Flat reports
    • (exceptions: export to file, modify existing documents)

The remainder of this section categorizes the different elements that can be used to build up functionality:

  • Custom Inputs: Which data to bring in, what filtering and transforms to apply
  • Custom Control: Scheduling and other controls
  • Custom Processing: Generic scripting or customizable templates, single or chained (roadmap)
  • Custom Outputs: Where the results of the data end up (documents, records, custom tables)

Custom Inputs

Custom Control

Custom Processing

Custom Outputs

  • Table Output
  • Record Output (ROADMAP)
  • Document Output (ROADMAP)