Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Provides This method provides source builders who need to add javascript to the enrichment process with a single global set of variables and functions that can be used in the individual "scriptlets" other elements provide. The scripts provided (and same for the inputs) are executed once per source per harvest cycle.

For general information about using javascript with IKANOWS, see Using Javascript.

Table of Contents

Format

TOD convert to JSON

Code Block
languagejava
{
	"display": string, 
	"globals": { } // see GlobalScriptPojo below
}
//////////////////////////////////
	public static class GlobalScriptPojo {
		public List<String> imports; "imports": [ string ],// An optional list of URLs that get imported before the scripts below are run
		public List<String> scripts;"scripts": [ string ], // A list of (java)script code blocks that are evaluated in order (normally only need to specify one)
		public String scriptlang;"scriptlang":string, // Currently only "javascript" is supported
	}
}

Description

The Infinit.e platform supports scripting the transformation of source data using JavaScript via Rhino, Mozilla's open-source JavaScript implementation (http://www.mozilla.org/rhino/). The following document provides an introduction to specifying JavaScript based data transformation via the Structured Analysis Harvester object.

Info

Note that unless turned off from the configuration files (via the "harvest.security" property), Javascript is prevented by the Java security manager from doing the following:

  • "Internal" network access (ie to addresses 127.*.*.*, 10.*.*.* or 192.168.*.*)
  • File access.

Importing

The Infinit.e Structured Analysis Harvester supports importing of JavaScript functions in two ways currently:

  • specifying a javascript code block
  • a lit or urls of javascript locations that can be imported.

 

Info

JavaScript functions imported via either of the two means described above are passed to the Rhino script engine via the ScriptEngine.eval method which allows the functions to be called within the scope of the current document being harvested. Examples of imported functions can be found below.

Note: the "script" context does not have access to any of the objects described below (like "_doc"), it can only be used for declaring functions to be used in the entity/association/docGeo scriptlets.

 

...

Field Level Transformations

Individual fields from a data source can be transformed using JavaScript by either calling an imported function (described above) or by using inline JavaScript.

Inline JavaScript

Inline JavaScript is enclosed within $SCRIPT( ). During the harvesting process the system:

  1. Extracts the JavaScript code contained within the $SCRIPT() block
  2. Wraps the script with the following generic script block:

    Code Block
    function getValue() { ... }
  3. Passes the function to the ScriptEngine using the ScriptEngine.eval() method
  4. Calls the getValue() method using the .invokeFunction() method.

Calling Imported JavaScript Functions

Calling functions previously imported into the ScriptEngine is done by enclosing the function name to be called within the $FUNC() block as shown above in the "title" field. During the harvesting process the harvester:

  1. Extracts the name of the function to execute from the $FUNC() block
  2. Calls the specified function using the .invokeFunction() method.

 

Info

$FUNC only has meaning when it encloses the entirety of the string. For calling functions inside $SCRIPT blocks (described above), just invoke the function normally.

 

As Document Metadata iterates over documents to be harvested it passes each document to the Rhino ScriptEngine. The JSON based document passed into the ScriptEngine is then converted into an object via JavaScript's eval() method (i.e.var _doc = eval('document)'). Fields within the document are then available to functions and inline scripts using the JavaScript dot or subscript operators as shown below.

 

Code Block
var description = _doc.metadata.description[0];

Setting Field Values Example

In the following example, doc.metadata.json is used in javascript, in order to set some metadata values for the ingested Twitter data.

 

...

 

Globals is used to set functions that the other elements that use javascript can access.

FieldDescription 
imports

An optional list of URLs that get imported before the scripts below are run

 
scripts

A list of (java)script code blocks that are evaluated in order (normally only need to specify one)

 
scriptlang

Currently only "javascript" is supported

 

Examples

In the example below globals is used to declare some javascript functions.  They are accessed by the docMetadata elements below to set some metadata values for the documents.

Code Block
  {
            "docMetadataglobals": {
                "titlescripts": "$metadata.json.body",[
                "description": "$metadata.json.body",   "function getAddressVal( addressStr, number) { try { var addressArray = addressStr.split(/ *, */);  "fullText": "$metadata.json.body",
                "publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)",
                "geotag": {
                    "lat": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[0];} catch (err) {return '';})",
                    "lon": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[1];} catch (err) {return '';})"
                }if (addressArray != null && addressArray.length > 0) { if (addressArray[number].toLowerCase()=='ny') { return 'new york'; } else if (addressArray[number].toLowerCase()=='long island' || addressArray[number].toLowerCase()=='li') { return 'medford'; } else { return addressArray[number]; } } else { return ''; } } catch (err) { return ''; } } function getRegion( code ) { if (code.toLowerCase()=='ny') {return 'New York';} else if (code.toLowerCase()=='nj') {return 'New Jersey';} else if (code.toLowerCase()=='ct') {return 'Connecticut';} else if (code.toLowerCase()=='md') {return 'Maryland';} else if (code.toLowerCase()=='va') {return 'Virginia';} else if (code.toLowerCase()=='pa') {return 'Pennsylvania';} else if (code.toLowerCase()=='nj') {return 'New Jersey';} else {return 'New York';} }"
                ]
 }         },

 

Sample Output:

The following sample output, shows the resultant metadata about the Twitter feed.

Code Block
  }]},
   "modified": "Nov 8, 2012 06:02:44 PM UTC"},
    "publishedDate": "Nov 8, 2012 06:02:02{
PM UTC",
    "source": ["gnip test"],     "sourceKeydocMetadata": [".mnt.fileshare.datasift.gnip."],
    "sourceUrl": "file:/mnt/fileshare/datasift/gnip/gnip.json",{
      "tags": [         "twittertitle",
        "gnip"
    ]: "$metadata.json.body",
    "title": "Amex Teams With Halo 4 on Master Chief Incentives http://t.co/IvwmjJyV #crm",     "urldescription": "http://twitter.com/FocalCRM/statuses/266601489475186688"
}

Complex Arrays

When document metadata iterates over a JSON array each item in the array is passed into the ScriptEngine and is made accessible via an object named: _iterator.

Example With Associations

The following associations block show that iterate over is used to specify an array "json.twitter-entities.hastags."  This will enable the data to be extracted from the JSON array to create metadata.

Code Block
 }$metadata.json.body",
            {                  "assoc_type"fullText": "Event$metadata.json.body",

                "entity1_indexpublishedDate": "$SCRIPT( return _doc.metadata.json[0].postedTime.actor.preferredUsername + '/twitterhandle';)",
                 "entity2_index": "$SCRIPT( return _iterator.text + '/hashtag'; replace(/.[0-9]{3}Z/,'Z');)",
                 "iterateOvergeotag": "json.twitter_entities.hashtags",
  {
              "verb": "tweets_about",
                 "verb_category"lat": "tweets_about"
             },
             {

About Arrays

The code block below demonstrates how a field within a document's metadata might hold items in a typical JSON array object.

 

Code Block
{
    ...
    metadata : {
        cars : [
            {"make" : "Ferrari", "model" : "599", "year" : "2011"},$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[0];} catch (err) {return '';})",
                  {"make" : "Ferrari", "model" lon": "California", "year" : "2011"},
            {"make" : "Ferrari", "model" : "458 Italia", "year" : "2011"}$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[1];} catch (err) {return '';})"
           ]     }
    ... }  

 

The sample JavaScript below demonstrates how to access each field within an array item as the harvester iterates over the array:

Code Block
title
var make = _iterator.make; var model = _iterator.model;
var year = _iterator.year;

Passing Values to Scripts via _value:

In the above case, if the script engine is iterating over an array of primitives (eg "_doc.metadata.cars == ['Ferrari', 'Alfa Romeo', 'Fiat' ]") then the values are passed into _value instead of _iterator.

Code Block
var s = _value;
Info

Each time a value is passed into the ScripEngine the content of _value is over written.

 

Extracting Data from Arrays with _index:

The harvester supports passing an index value into the ScriptEngine that can be used to access a specific item in an array by its index. An example of how the _index variable can be used is show below:

Code Block
var make = _doc.metadata.cars[_index].make;
 }

 

 

Panel

Footnotes:

Legacy documentation, replaces the following:

Legacy documentation:

 

...