Source Pipeline Documentation
Processing in IKANOW
There are two types of complex processing possible in IKANOW's Infinit.e platform:
- Input. transform and storage of data of many different types - into documents or records
- Documents are larger and more complex objects typically generated from more complex XML/JSON, natural language heavy things like web-sites and reports
- The Infinit.e platform enables a powerful pipeline of templated operations to transform these data types into our "generic document model"
- Records are smaller objects like single line log records, simple JSON objects, SQL records etc
- The Infinit.e platform places almost no restrictions on the format of the JSON or how it is imported into the system, though we integrate particularly well with the popular community-driven platform "logstash"
- Documents are larger and more complex objects typically generated from more complex XML/JSON, natural language heavy things like web-sites and reports
- Applying custom logic to existing documents and records in order to enrich the system with new data and functionality like:
- "Reports" - eg spreadsheet-like or statistical data containing directly actionable information
- New records and documents - typically alerts, or aggregate "events" made up of multiple documents/records
- Lookup tables, which can then be used to enrich new and existing documents (eg local asset information), generate alerts (eg malicious domains)
IKANOW uses the popular Hadoop ecosystem to power its custom processing capabilities, integrating in our output, management, monitoring and security layers.
The following diagram shows how these 2 different activities are related:
The same JSON-based configuration language, with associated UI, can be used to build and maintain both types of pipeline. Typically the elements don't mix (ie a pipeline consists entirely of elements from either the "standard" set or the "custom" set, though there are some exceptions described below.
Input Sources
Overview
From the October 2013 release (alpha), harvesting and enrichment is a more logical process based around the concept of applying a pipeline of processing elements to documents emanating from a source.
This is illustrated visually below:
Note that in practice there is considerable flexibility in the order of pipeline elements. They must start with an extractor, and the global elements must be next, and (currently) "Follow Web links" must be next, and for Web/Feed extractors only.
Aside from that the pipeline elements can be in any order and have any cardinality.
- For example you could create metadata from raw HTML (using xpath), then have an automated text extractor, then pull our more metadata using regex/javascript, then return to the original raw text, and then run a different automated extractor before creating entities.
- A very useful scenario involves running the data through several entities extractors, potentially using the "criteria" field to choose which one to run based on the content and metadata extracted so far
As can be seen in the diagram above, the pipeline elements can be approximately grouped into the following categories:
- Extractors: generate mostly empty Infinit.e documents from external data sources
- Globals: generate javascript artifacts that can be used by subsequent pipeline elements
- Secondary extractors: Enables new documents to be spawned from the existing metadata
- Text extraction: manipulation of the raw document content
- Metadata: generate document metadata such as title, description, date; and arbitrary content metadata using xpath, regex, and javascript
- Entities and associations: create entities and assocations out of the text
- Storage and indexing: decide which documents to keep, what fields to keep, and what to full text index (for searching using the GUI/API)
The remainder of this page provides more details on each of the pipeline elements available under each of these categories.
IN PROGRESS: most of these links currently provide the current object used to configure the corresponding pipeline element (in either JSON or Java POJO format), and point to the legacy documentation for the functionality being replaced.
Document extractors
- Feed extractor
- Web extractor
- File extractor
- Database extractor
- Logstash extractor
- Federated Query Source
- Post Processing (enterprise only) (ROADMAP)
Global processing
- Harvest control settings
- Javascript globals
- Lookup Tables
- Aliasing (not currently supported) (ROADMAP)
Secondary extractors
Text processing
Metadata enrichment
Entities and associations
Storage and Indexing settings
Other topics
Custom Processing Sources
Overview
From the November 2014 release, the source editor can be used to build and maintain custom processing. (Prior to that it was necessary to use the Plugin Manager UI).
As described in the overview above, the custom processing engine is a highly customizable Hadoop-based workflow. In essence:
- It takes in records and documents primarily
- Also external files and results from already-run custom jobs
- It applies different types of processing to the data:
- Generic templated business logic using Java or scripting languages (currently JS, Python is on the roadmap)
- A growing set of built-in configurable functions
- Generic high speed aggregation
- Generic high speed filtering and joining on other data sources
- Data format exploration
- Roadmap: real-time analytics using Storm-over-Hadoop
- The output of the processing is (with a few exceptions) objects representing things like:
- New documents or records (alerts, groups of documents eg tweets)
- Lookup tables
- Flat reports
- (exceptions: export to file, modify existing documents)
The remainder of this section categorizes the different elements that can be used to build up functionality:
- Custom Inputs: Which data to bring in, what filtering and transforms to apply
- Custom Control: Scheduling and other controls
- Custom Processing: Generic scripting or customizable templates, single or chained (roadmap)
- Custom Outputs: Where the results of the data end up (documents, records, custom tables)
Custom Inputs
- Distributed File Input
- Process Existing Docs, simple query
- Process Existing Docs, complex (Infinit.e) query
- Process Existing Records
- Process Custom Results
- Process Entity and Association Features
Custom Control
Custom Processing
- Run Built-in/Custom Hadoop Module
- Run Distributed Scripting Engine
- Run Custom Hadoop Mapper/Combiner/Reducer
Custom Outputs
- Table Output
- Record Output (ROADMAP)
- Document Output (ROADMAP)