IKANOW Flow Builder

Overview

The IKANOW platform provides a JSON-based configuration language for building sources, together with a UI to help with configuring and connecting the JSON elements, together with various utility applications helping configure specific input types (RSS, datasift).

The philosophy was to have a small number of logically different elements, each containing large amounts of functionality and flexibility. This is considered optimal for "power users".

More recently we have started to look at ways of improving the platform's usability for a wider base of users, who may not have scripting or programming experience. For these users it is preferable to provide a large number of inflexible components - inflexible meaning the parameters that need to be entered to configure a component should be simple and small in number.

In keeping with our design principles of providing "plug-in points" vs mandating how the platform should be used, we are not going to try to pre-define what that large number of components should be (and indeed it will change with time and vary widely across the different domains - eg social media vs cyber intelligence). Instead we have provided a simple method for "power users" to build and upload their own components, which then map back to the built-in IKANOW source pipeline.

Users can then drag, drop, connect and configure the components relevant to them to build up complex "ETL flows" in an intuitive way. 

The next section describes how to use the flow builder, given a set of components, and the section after that describes how to build components.

Using the flow builder to create sources

New sources

From the Source Editor. create a "New Source" (top right button), and select the "Empty Source Template" (one of the "user/shared templates", if created using the flow builder).

Enter a title/description/tags etc as per normal, and save.

Once saved use the "FLOW UI" button in the top left of the JSON editor to launch the UI:

(NOTE: the source editor sometimes loads more quickly than the flow builder, in which case pressing this button will give an error telling you to wait a few seconds and try again)

This brings up a blank canvas with a search bar/options button in the top left corner, pressing the button will bring up another floating menu - press the "+" button on that will bring up the available list of components:

Each of these components can be dragged into the canvas by the blue square next to their names (apart from dataflow-input and dataflow-output - these are used in the subgraph functionality described below).

Once on the canvas, components can be connected from their "Output" connection to another components "Input" connection - this means that documents will flow across components:

 

Notes:

  • After dragging a component onto the canvas, the list disappears - use the above two button presses to bring it back.
  • Components that start with "1/input" do not have an "Input" connection and should start the flow. Only one of those per flow is currently allowed.
  • No "end component" is necessary - when a document reaches the end of a flow it is placed in the IKANOW datastore.
Editing components

To edit a component, select it in the canvas, and a dialog box appears to the left:

There's a few different operations that can be performed:

  • Delete the component: press the "scissors" icon, bottom left
  • Set the title: Click on the current title (immediately below the "Search" menu - eg "6/entities/Regex to entity" in the example image above)
  • Set the parameters for that component: simply edit the text boxes etc below the title and type, (Note: ignore the Input/Output fields - they are reserved)
Editing connections

Select the connection and a different dialog box appears. Allowed operations:

  • Press the "x" button to remove the connection
  • Set the title: click on the text immediately below the "x" button ("Edge") in the above example and type
  • Change the color: this is just for display purposes, there is no difference in functionality.
Conditionals and branches

Components starting with "0/conditional/" are a bit different in that they have two outputs: a "true" and a "false".

Documents that match the particular criteria (component-defined - see section below on component building) will flow down the "true" path, and documents that don't will flow down the "false" path. The paths can be re-combined simply by creating connections from multiple branches back down to a single one; however, paths do not need to be combined - they are implicitly recombined at their path's end if not.

 or 

Paths can be nested:

Certain graph types are not supported:

  • Merging from one sub-branch to a different branch, eg:

Subgraphs

Subgraphs (use the "dataflow-subgraph" component) allow groups of components to appear as a single element in the main view).

They are currently not supported but are coming soon.

Saving the source

To exit the flow builder, simply click on the border surrounding it (showing "greyed out" bits of the source editor).

The source is then saved, tested, and published as normal using the source editor - consult the linked documentation for more details.

Editing existing sources

Simply select the source to edit in the source editor, and press the "FLOW UI" button.

Building components

Component JSON definition

A Component is defined by a set of JSON fields some are mandatory others are optional (see below).

 

{
	"name": "string", // e.g. "pdf_extract"
	"type": "string", // e.g. "input", so will generate the component path "input/pdf_extract"
	"fields": [ // will map to input ports
		{
			"fieldname": "string", // the field name of the input			
			"type": "string" // one of "string", "boolean", "int","float"					
		},
		//..
	],
	sourceBuilder: function(flowElement, source, pipeline, lastSourceElement) {
		// user callback that will take the JSON "flowElement", and use it to create a new source pipeline element in "pipeline" (==source.processingPipeline, source is provided so you can set tags and things)
		// "lastSourceElement" element just points to the last element returned from "sourceBuilder", or pipeline[pipeline.length-1] if null			
	},
	sourceValidator: function(flowElement, source, pipeline, lastSourceElement) {
		// same params as above, called before "sourceBuilder" - if returns a non-null string then source building is interrupted and the string is returned to the user
	}
}
Supported types
  • The following type values are supported:

 "conditional" - creates an if-then-else element supporting splits in the flow logic.

The type values "input" , "globals" , "extractors" , "text" , "metadata" , "entities" , "storage"

are mostly used to order components in a hierarchy and are used to validate the overall flow.

 

  • The "fields" describe the input parameters by specifying a "fieldname" and a "type" for each input.


Please note, outputs are created based on the component's type, e.g. conditional elements have outputs where all other components will have 1 output.

Conditional Element

Conditional elements can be used to create a split component that handles if-then-else logic.

Each conditional element should return a "criteria" script attribute that is evaluated by the harvest control logic.e.g.

"sourceBuilder":  function(flowElement, source, pipeline, lastSourceElement) {
          if (null == flowElement.state) flowElement.state = {};
          var critString = "_doc[" + flowElement.state['Fieldname to test'] + "].matches(/" + flowElement.state['Regex'] + "/)";
          var element = {
              display: "Check a document field",
              criteria: "$SCRIPT( return " + critString + ";)"
          };
        pipeline.push(element);       
    },

Please note the flowbuilder process will evaluate conditional components and logical splits and add "$SETPATH(),$PATH() and $SCRIPT() 

 to the criteria. This is internally used to keep track of branches in the logic.

The sourceBuilder function and the sourceValidator functions

In order to add custom components to the Source Pipeline two functions need to be defined, the sourceBuilder function and the sourceValidator function (see below for details).

The sourceBuilder function uses input parameters from the flowElement and lastSource element and creates one or multiple source-element objects which are stored in the pipeline.

Each source-element can be validated by the sourceValidator function.  If a validation error occurs the expected behavior of the sourceValidator function is to return a string containing the validation error.  If all went well the function is returning null.

During the component build process the following parameters are passed to the two functions:

flowElement - contains the input parameter values in the state attribute 
source - the Javascript object representing the source , parent of object contaoining the processingPipeline. This can be used to set global variables/tags into the source etc.
pipeline - this empty array will contain the output (pushed elements) that the sourceBuilder function creates.
lastSourceElement - created source element from the predecessor flow component.

 

Typical outline of the sourceBuilder function:

The following example describes the steps taken to build a source element for the JSON File extractor element;

// step 1 - create an element to return to the source's processingPipeline
var element = {
         display: "Local JSON files",
         file: {
				// step 2 - use flowEelement's state (map) variable to get access to the input values from the UI
                XmlPrimaryKey: flowElement.state['Primary Key'],
                XmlSourceName: flowElement.state['URL Prefix'],
                XmlRootLevelValues: rootObjects,
                pathInclude: flowElement.state['Include Filter'],
                pathExclude: flowElement.state['Exclude Filter'],
                url: flowElement.state['Path']
          }
 } ;  
 
 // step 3 - do some more custom processing
 if (flowElement.state['Delete Once Done']) {
		 element.file.renameAferParse = ".";
 }

// step 4 - push the element into the pipline
pipeline.push(element);

 

The source parameter can be used to insert additional attributes into the source e.g.:

source.tags="news";

 

Sometimes it might be necessary to modify the previous source element, e.g. if you have a featureEngine that runs different blocks of functionality. In that case the lastSourceElement parameter can be used to make  modifications to the previous element:

lastSourceElement.featureEngine.engineConfig[newId++] = nextConfigOption

 

Uploading and editing component files using the File Uploader

One can use the File Uploader from the main Manager window  to create and share components:

 

Within the 'File Uploader' the 'Type' needs to be set as 'infinite_flow_component'. 

Please note that in order to have the components available within communities one or several communities need to be selected by highlighting them (hold down CTRL key for selecting many).

Clicking on the edit button will bring up an editor window where the components JSON can be entered.

Function encoding

Generally in JSON functions needs to be encoded in a special format,e.g.

{ "$fn": "<<function.toString()>>" } 

 However the editor takes care of it internally so the functions can be just edited 'as is' in place. They will be encoded in the correct format once the editor is saved (submit button).

The following example shows  the functions within the source editor:

 

Some example components
[
   {
      "name": "JSON Local File Extractor",
      "type": "input",
      "description": "JSON Local File Extractor",
      "fields": [
         {
            "fieldname": "Path",
            "type": "string"
         },
         {
            "fieldname": "Delete Once Done",
            "type": "boolean"
         },
         {
            "fieldname": "Include Filter",
            "type": "string"
         },
         {
            "fieldname": "Exclude Filter",
            "type": "string"
         },
         {
            "fieldname": "Root Objects",
            "type": "string"
         },
         {
            "fieldname": "Primary Key",
            "type": "string"
         },
         {
            "fieldname": "URL Prefix",
            "type": "string"
         }
      ],
      "sourceBuilder": function (flowElement, source, pipeline, lastSourceElement) {
         var rootObjects = [];
         if (null != flowElement.state['Root Objects']) {
             rootObjects = flowElement.state['Root Objects'].split(',');
         }                
    
         var element = {
                 display: "Local JSON files",
                 file: {
                        XmlPrimaryKey: flowElement.state['Primary Key'],
                        XmlSourceName: flowElement.state['URL Prefix'],
                        XmlRootLevelValues: rootObjects,
                        pathInclude: flowElement.state['Include Filter'],
                        pathExclude: flowElement.state['Exclude Filter'],
                        url: flowElement.state['Path']
                  }
         } ;  
         if (flowElement.state['Delete Once Done']) {
             element.file.renameAferParse = ".";
         } 
         pipeline.push(element);       
    },
      "sourceValidator": function (flowElement, source, pipeline, lastSourceElement) {
        if (null == flowElement.state) {
            return 'No parameters specified';
        }
        if (null == flowElement.state['Path']) {
            return 'No path specified';
        }
        console.log('sourceValidator function'+flowElement+','+source+','+pipeline+','+lastSourceElement);
        return null;
    }
   },
   {
      "name": "Python Gazeteer Lookup",
      "type": "entities",
      "description": "Python Gazeteer Lookup",
      "fields": [
         {
            "fieldname": "List of Files",
            "type": "string"
         }
      ],
      "sourceBuilder": function (flowElement, source, pipeline, lastSourceElement) {
          var element = {
                display: "Python Gazeteer Lookup function.",                
                featureEngine: {
                      "engineName":"python",
                      "engineConfig": {
                        "python.fileList": flowElement.state['List of Files']
                      }
                }
        }   
        pipeline.push(element);        
      },
      "sourceValidator": function (flowElement, source, pipeline, lastSourceElement) {
          if ((null == flowElement.state) || (null == flowElement.state['List of Files'])) {
              return "Need to specify 'List of Files'"; 
          }
           console.log('sourceValidator function'+flowElement+','+source+','+pipeline+','+lastSourceElement);
        return null;
    }
   },
   {
      "name": "Split on Regex Match",
      "type": "conditional",
      "description": "Split on Regex Match",
      "fields": [
         {
            "fieldname": "Fieldname to test",
            "type": "string"
         },
         {
            "fieldname": "Regex",
            "type": "string"
         }
      ],
      "sourceBuilder": function (flowElement, source, pipeline, lastSourceElement) {
          if (null == flowElement.state) flowElement.state = {};
          var critString = "_doc[" + flowElement.state['Fieldname to test'] + "].matches(/" + flowElement.state['Regex'] + "/)";
          var element = {
              display: "Check a document field",
              criteria: "$SCRIPT( return " + critString + ";)"
          };
        pipeline.push(element);        
    },
      "sourceValidator": function (flowElement, source, pipeline, lastSourceElement) {
        if (null == flowElement.state) {
            return 'No parameters specified';
        }    
        console.log('sourceValidator function'+flowElement+','+source+','+pipeline+','+lastSourceElement);
        return null;
    }
   },
   {
      "name": "Regex to Entity",
      "type": "entities",
      "description": "Regex to Entity",
      "fields": [
         {
            "fieldname": "Regex",
            "type": "string"
         },
         {
            "fieldname": "Entity Type",
            "type": "string"
         },
         {
            "fieldname": "Entity Dimension",
            "type": "string"
         }
      ],
      "sourceBuilder": function (flowElement, source, pipeline, lastSourceElement) {
      if (null == flowElement.state) flowElement.state = {};    
      var regexElement = { // https://ikanow.jira.com/wiki/display/INFAPI/Content+metadata
            fieldName: Math.random().toString(36).substring(7),
            store:false,
            script: flowElement.state['Regex'],
            scriptlang:"regex"          
        };
        pipeline.push({ contentMetadata: [ regexElement ] });
        var entityElement = { // https://ikanow.jira.com/wiki/display/INFAPI/Manual+entities
            iterateOver: regexElement.fieldName,
            disambiguated_name: '$VALUE',
            type: flowElement.state['Entity Type'],
            dimension: flowElement.state['Entity Dimension']
        };
        pipeline.push({ entities: [ entityElement ] });
    },
      "sourceValidator": function (flowElement, source, pipeline, lastSourceElement) {
        if (null == flowElement.state) {
            return 'No parameters specified';
        }
        if (null == flowElement.state['Regex']) {
            return 'No regex specified';
        }    
        if (null == flowElement.state['Entity Type']) {
            return 'No entity type specified';
        }    
        console.log('sourceValidator function'+flowElement+','+source+','+pipeline+','+lastSourceElement);
        return null;
    }
   }
]

 

Tips for developing components

It is helpful to watch the output of the javascript console when developing components.

Also the use of a debugger, e.g. firebug is recommended.