Transforming data with JavaScript

Overview

The Infinit.e platform supports scripting the transformation of source data using JavaScript via Rhino, Mozilla's open-source JavaScript implementation (http://www.mozilla.org/rhino/). The following document provides an introduction to specifying JavaScript based data transformation via the Structured Analysis Harvester object.

Note that unless turned off from the configuration files (via the "harvest.security" property), Javascript is prevented by the Java security manager from doing the following:

  • "Internal" network access (ie to addresses 127.*.*.*, 10.*.*.* or 192.168.*.*)
  • File access.
Specifying the ScriptEngine

As noted above the beta version of Infinit.e currently supports data transformation via JavaScript functions. To enable the use of JavaScript you need to include the scriptEngine key/value pair within the Structured Analysis object:

Source.structuredAnalysis object
"structuredAnalysis" : {
    ...
    "scriptEngine" : "JavaScript",
    ...
}

Note: Although scripting support within Infinit.e is currently limited to JavaScript it is possible that support for other languages will be added in the future.

Importing JavaScript Functions

The Infinit.e Structured Analysis Harvester supports importing of JavaScript functions in two ways currently:

  • The inclusion of functions in the script key/value pair (see below for an example).
  • External JavaScript files that can be accessed via URL by the Infinit.e Structured Analysis Harvester (see the scriptFiles key/value pair below for an example);

    Source.structuredAnalysis object
    "structuredAnalysis" : {
        ...
        "script" : "function getEventType() { var s = _value; ...; return s; }",
        "scriptFiles" : [ "http://localhost/script1.js", "http://localhost/script2.js" ],
        ...
    }

JavaScript functions imported via either of the two means described above are passed to the Rhino script engine via the ScriptEngine.eval method which allows the functions to be called within the scope of the current document being harvested. Examples of imported functions can be found below.

Note: the "script" context does not have access to any of the objects described below (like "_doc"), it can only be used for declaring functions to be used in the entity/association/docGeo scriptlets.

Accessing Document Data via  the _doc Object

As the Structured Analysis Harvester iterates over documents to be harvested it passes each document to the ScriptEngine (if a ScriptEngine has been instantiated, see Specifying the ScriptEngine above) via the ScriptEngine.put method. The JSON based document passed into the ScriptEngine is then converted into an object via JavaScript's eval() method (i.e.var _doc = eval('document)'). Fields within the document are then available to functions and inline scripts using the JavaScript dot or subscript operators as shown below:

var description = _doc.metadata.description[0];

Note that _doc is persistent across all entity/association script calls, and therefore can be used to store temporary variables, eg for deduplication, eg:

"$SCRIPT( if (null == _doc.dedupMap) _doc.dedupMap = new Object(); /* ... */ if (_doc.dedupMap[val] == null) { _doc.dedupMap[val] = 1; return val; } else return null; )"

Note that "_doc.metadata.FIELDNAME[0].*" is equivalent to "$metadata.FIELDNAME.*" (the [0] disappears because the $ method just treats arrays as objects equal to the first element in the array - to access other elements, the "$SCRIPT" technique must be used)

Basic Field Level Transformations

Individual fields from a data source can be transformed using JavaScript by either calling an imported function (described above) or by using inline JavaScript.

Source.structuredAnalysis object
"structuredAnalysis" : {
    ...
    "title" : "$FUNC( getDocumentTitle(); )",
    "description" : "$SCRIPT( return 'Description: ' + _doc.metadata.description[0];)",
    ...
}

Inline JavaScript

In the example JSON above the "description" field contains inline JavaScript enclosed within $SCRIPT( ). During the harvesting process the Structured Analysis Harvester:

  1. Extracts the JavaScript code contained within the $SCRIPT() block
  2. Wraps the script with the following generic script block:

    function getValue() { ... }
    
  3. Passes the function to the ScriptEngine using the ScriptEngine.eval() method
  4. Calls the getValue() method using the .invokeFunction() method.

Calling Imported JavaScript Functions

Calling functions previously imported into the ScriptEngine is done by enclosing the function name to be called within the $FUNC() block as shown above in the "title" field. During the harvesting process the harvester:

  1. Extracts the name of the function to execute from the $FUNC() block
  2. Calls the specified function using the .invokeFunction() method.

Note: $FUNC only has meaning when it encloses the entirety of the string. For calling functions inside $SCRIPT blocks (described above), just invoke the function normally.

Using the _iterator Object

When specifying entities or events to create from source data it is possible to specify that data be extracted from JSON arrays within the metadata field using the IterateOver field (see Specifying Entities and Specifying Events for more information). When the harvester iterates over a JSON array each item in the array is passed into the ScriptEngine and is made accessible via an object named: _iterator.

The code block below demonstrates how a field within a document's metadata might hold items in a typical JSON array object.

{
    ...
    metadata : {
        cars : [
            {"make" : "Ferrari", "model" : "599", "year" : "2011"},
            {"make" : "Ferrari", "model" : "California", "year" : "2011"},
            {"make" : "Ferrari", "model" : "458 Italia", "year" : "2011"}
        ]
    }
    ...
}

The sample JavaScript below demonstrates how to access each field within an array item as the harvester iterates over the array:

var make = _iterator.make;
var model = _iterator.model;
var year = _iterator.year;
Passing Values to Scripts via _value

In the above case, if the script engine is iterating over an array of primitives (eg "_doc.metadata.cars == ['Ferrari', 'Alfa Romeo', 'Fiat' ]") then the values are passed into _value instead of _iterator.

var s = _value;

Note: Each time a value is passed into the ScripEngine the content of _value is over written.

Extracting Data from Arrays with _index

The harvester supports passing an index value into the ScriptEngine that can be used to access a specific item in an array by its index. An example of how the _index variable can be used is show below:

var make = _doc.metadata.cars[_index].make;
JavaScript Error Messages

If the ScriptEngine encounters errors when executing a script the Structured Analysis Harvester traps the error and stores it in memory until the process of harvesting all of the documents for a given source has completed. When a source has been harvested the harvester updates the Harvest object within the Source document (harvest.harvested, harvest.harvest_status, and harvest.harvest_message fields). If the harvester encountered JavaScript errors it writes the top five errors to the harvest.harvest_message field  including the number of times each error was encountered.

Note: Field level JavaScript errors will not prevent a document or source from being harvested.

Creation Criteria Scripts

Both entity and association specification objects provide a field called "creationCriteriaScript". This must be JavaScript (though you still need to set the engine and enclose in either $SCRIPT or $FUNC), and you can return one of two things from it:

  • A boolean, in which case the entity object is only created if 
  • A string, in which case any non-null string is treated like a boolean false, and in addition the string is logged as an error that can be accessed from the "harvest.harvest_message" field of sources.

The creation criteria script is executed before any other scripts in the specification object.

Lookup tables in the Unstructured Analysis Handler

It is possible to add lookup tables from JSON shares that can be used in all the javascript scripts in the structured analysis handler (and also the unstructured analysis handler).

These lookup tables to provide a limited form of aliasing a harvest time - also check out the full query-time aliasing capability - in addition to many other cases where a potentially large and dynamic lookup table would be useful.

Using the lookup technology is easy:

  • At the top level of the "structuredAnalysis" object, create a "caches" object that consists of the following:
"structuredAnalysis": {
	"caches": {
		"myLookupTable": "4e0c7e99eb5af0fbdcfbf697"
	}
}
  • Then within any script in the "structuredAnalysis", you can access the JSON object by indexing the global variable "_cache" with the local name specified as above. For example, say the following JSON object has been uploaded:
{
	//...
	"US": "United States", "USA", "United States of America",
	"UK": "United Kingdom", "Great Britain", "GB",
	//...
} 

Then the lookup table could be used as follows:

{
	"structuredAnalysis": {
		// (caches object specified as above)
		//...
		"entities": [
			//...
			{
				"iterateOver": "geo.countries",
				"disambiguatedName": "$SCRIPT( return _cache['myLookupTable'][ _value ];)",
				"type": "Country"
			}
		],
		//...
	}
}
Further Reading