Overview

Provides source builders who need to add javascript to the enrichment process with a single global set of variables and functions that can be used in the individual "scriptlets" other elements provide. The scripts provided (and same for the inputs) are executed once per source per harvest cycle.

Format

{
	"display": string, 
	"globals": {
		"imports":string,// An optional list of URLs that get imported before the scripts below are run
		"scripts":string, // A list of (java)script code blocks that are evaluated in order (normally only need to specify one)
		"scriptlang":string, // Currently only "javascript" is supported
	}

Description

The Infinit.e platform supports scripting the transformation of source data using JavaScript via Rhino, Mozilla's open-source JavaScript implementation (http://www.mozilla.org/rhino/). The following document provides an introduction to specifying JavaScript based data transformation via the Structured Analysis Harvester object.

Note that unless turned off from the configuration files (via the "harvest.security" property), Javascript is prevented by the Java security manager from doing the following:

"Internal" network access (ie to addresses 127.*.*.*, 10.*.*.* or 192.168.*.*)
File access.

Importing

The Infinit.e Structured Analysis Harvester supports importing of JavaScript functions in two ways currently:

specifying a javascript code block
a lit or urls of javascript locations that can be imported.

JavaScript functions imported via either of the two means described above are passed to the Rhino script engine via the ScriptEngine.eval method which allows the functions to be called within the scope of the current document being harvested. Examples of imported functions can be found below.

Note: the "script" context does not have access to any of the objects described below (like "_doc"), it can only be used for declaring functions to be used in the entity/association/docGeo scriptlets.

Examples

Field Level Transformations

Individual fields from a data source can be transformed using JavaScript by either calling an imported function (described above) or by using inline JavaScript.

Inline JavaScript

Inline JavaScript is enclosed within $SCRIPT( ). During the harvesting process the system:

Extracts the JavaScript code contained within the $SCRIPT() block
Wraps the script with the following generic script block:
```
function getValue() { ... }
```
Passes the function to the ScriptEngine using the ScriptEngine.eval() method
Calls the getValue() method using the .invokeFunction() method.

Calling Imported JavaScript Functions

Calling functions previously imported into the ScriptEngine is done by enclosing the function name to be called within the $FUNC() block as shown above in the "title" field. During the harvesting process the harvester:

Extracts the name of the function to execute from the $FUNC() block
Calls the specified function using the .invokeFunction() method.

$FUNC only has meaning when it encloses the entirety of the string. For calling functions inside $SCRIPT blocks (described above), just invoke the function normally.

As Document Metadata iterates over documents to be harvested it passes each document to the Rhino ScriptEngine. The JSON based document passed into the ScriptEngine is then converted into an object via JavaScript's eval() method (i.e.var _doc = eval('document)'). Fields within the document are then available to functions and inline scripts using the JavaScript dot or subscript operators as shown below.

var description = _doc.metadata.description[0];

Setting Field Values Example

In the following example, doc.metadata.json is used in javascript, in order to set some metadata values for the ingested Twitter data.

},        {
            "docMetadata": {
                "title": "$metadata.json.body",
                "description": "$metadata.json.body",
                "fullText": "$metadata.json.body",
                "publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)",
                "geotag": {
                    "lat": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[0];} catch (err) {return '';})",
                    "lon": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[1];} catch (err) {return '';})"
                }
            }
        },

Sample Output:

The following sample output, shows the resultant metadata about the Twitter feed.

  }]},    "modified": "Nov 8, 2012 06:02:44 PM UTC",
    "publishedDate": "Nov 8, 2012 06:02:02 PM UTC",
    "source": ["gnip test"],
    "sourceKey": [".mnt.fileshare.datasift.gnip."],
    "sourceUrl": "file:/mnt/fileshare/datasift/gnip/gnip.json",
    "tags": [
        "twitter",
        "gnip"
    ],
    "title": "Amex Teams With Halo 4 on Master Chief Incentives http://t.co/IvwmjJyV #crm",
    "url": "http://twitter.com/FocalCRM/statuses/266601489475186688"
}

Complex Arrays

When document metadata iterates over a JSON array each item in the array is passed into the ScriptEngine and is made accessible via an object named: _iterator.

Example With Associations

The following associations block show that iterate over is used to specify an array "json.twitter-entities.hastags." This will enable the data to be extracted from the JSON array to create metadata.

 },             {
                 "assoc_type": "Event",
                 "entity1_index": "$SCRIPT( return _doc.metadata.json[0].actor.preferredUsername + '/twitterhandle';)",
                 "entity2_index": "$SCRIPT( return _iterator.text + '/hashtag'; )",
                 "iterateOver": "json.twitter_entities.hashtags",
                 "verb": "tweets_about",
                 "verb_category": "tweets_about"
             },
             {

About Arrays

The code block below demonstrates how a field within a document's metadata might hold items in a typical JSON array object.

{
    ...
    metadata : {
        cars : [
            {"make" : "Ferrari", "model" : "599", "year" : "2011"},
            {"make" : "Ferrari", "model" : "California", "year" : "2011"},
            {"make" : "Ferrari", "model" : "458 Italia", "year" : "2011"}
        ]
    }
    ...
}

The sample JavaScript below demonstrates how to access each field within an array item as the harvester iterates over the array:

var make = _iterator.make;
var model = _iterator.model;
var year = _iterator.year;

Passing Values to Scripts via _value:

In the above case, if the script engine is iterating over an array of primitives (eg "_doc.metadata.cars == ['Ferrari', 'Alfa Romeo', 'Fiat' ]") then the values are passed into _value instead of _iterator.

var s = _value;

Each time a value is passed into the ScripEngine the content of _value is over written.

Extracting Data from Arrays with _index:

The harvester supports passing an index value into the ScriptEngine that can be used to access a specific item in an array by its index. An example of how the _index variable can be used is show below:

var make = _doc.metadata.cars[_index].make;

Footnotes:

Legacy documentation, replaces the following:

StructuredAnalysis object (replaces script and scriptFiles)
UnstructuredAnalysis object (replaces script)
Feed object (replaces searchConfig.globals)

Legacy documentation:

Transforming data with JavaScript

Javascript globals