Javascript Engine for Plugin Manager
Javascript Prototype Engine - basic operation
The Plugin Manager comprises a built-in processing module, the "Javascript Prototype Engine". This module runs javascript entered into the "user arguments" text field.
To start off, select the "HadoopJavascriptTemplate", then "Copy Current Plugin" and change the title.
The basic format of the javascript is a standard map/combine/reduce template, and must be as follows:
function map(key, val) { var key = // write code to obtain a key JSON object (note: must be an object, not just a string) var val = // write code to obtain a value JSON object emit( key, val ); // (can be called multiple times with different keys) } function combine(key, vals) { var aggregated_object = null; for (x in vals) { var val = vals[x]; //combine val with aggregated_object } emit(key, val); // (can be called multiple times with different keys - normally just called once with the same key though) } function reduce(key, vals) { var aggregated_object = null; for (x in vals) { var val = vals[x]; //combine val with aggregated_object } emit(key, val); // (can be called just once, the key must match the input) }
The map function takes the following parameters:
- key is the _id of the object described below
- val is an object from the input collection (eg a document if from "document metadata", an entity if from "aggregated entity", an association if from "aggregated association" etc)
The (optional) combine function takes the following parameters:
- key: the key emitted from the map functions
- vals: an unordered list of "val" object emitted from the map functions with the same key
The (optional) reduce function takes the following parameters:
- key: the key emitted from the map/combine functions
- vals: an unordered list of "val" object emitted from the map/combine functions with the same key
This code is distributed using Hadoop in a standard way, ie as follows:
- Based on the query and input collection a set of mappers are distributed across the Hadoop cluster, with each mapper being assigned a slice of the records matching the query
- The map function is invoked for each record, the val objects generated and output using the emit function are grouped by key.
- At the end of the map, the combine function is called on the (key, group-of-val objects) pairs ... the output from the emit function invocations are sent to the available reducers (normally just one)
- The reduce function is invoked for each received (key, group-of-val objects) pairs, the output from emit is written to the output collection.
The combine and reduce functions are normally the same, allowing the following shortcut:
// delete the combine function block and... combine = reduce;
Note that the combine function is normally present for performance and can be discarded without loss of functionality. If the reduce function is ignored then the output of the map is written directly to the output collection (/HDFS directory) - if multiple objects with the same key are emitted from the same map function, only one is preserved. Removing either of the combine/reduce functions must be accompanied by commenting out the corresponding Combiner/Reducer class in the plugin manager GUI.
Javascript Prototype Engine - advanced
Accessing the Hadoop context object
This is not currently possible. The user query can be accessed as a string called "_query" (eg and can therefore be converted to JSON by eg "queryJson = eval('(' + _query + ')");"
Calling java from javascript
Note that many java functions can be called from within javascript to extend the limited built-in functionality. Examples are provided below. The following relevant JARs are available:
- The platform data model library.
- commons-logging, commons-httpclient
Date parsing
Due to the slightly non-standard date storage in MongoDB, together with the limitiations of the built-in Date function in Rhino (the javascript engine built into Java), Community Edition/MongoDB formatted dates must be parsed in a slightly convoluted way:
// For just the date var date = new java.text.SimpleDateFormat('yyyy-MM-dd').parse(val.publishedDate["$date"]); // (or val.created or val.modified, or val.associations[x].time_start, etc) // For date time: var datetime = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss").parse(val.publishedDate["$date"]);
Writing javascript Date type to Mongo as Mongo ISODate
Due to the slightly non-standard date storage in MongoDB, together with the limitiations of the built-in Date function in Rhino (the javascript engine built into Java), javascript dates must be prepared for being written into Community Edition/MongoDB in a slightly convoluted way:
//current datetime var dateString = new Date().toISOString(); var mongoDate = {}; mongoDate['$date'] = dateString; //now it will be stored as proper Mongo ISODate when emitted: emit(key, {..., 'publishedDate':mongoDate,...});
Logging
Messages can be logged and are stored in the status message as follows (note unless run in debug mode, only ERROR category messages are retrieved).
// Put this outside the map/combine/reduce functions, ie in global space var logger = org.apache.log4j.Logger.getLogger("path.to.package"); // eg "com.ikanow.infinit.e.utility.hadoop.HadoopPrototypingTool" //Within the map/combine/reduce functions it can be called like: logger.info("Info message!"); // (discarded unless in debug mode) logger.error("Error message!!"); // (retrieved in local or debug mode)
Note that even error messages cannot currently be retrieved on full Hadoop clusters - debug mode must be used in that case even for ERROR messages.
Note also that logged messages are written to the internal Hadoop log files - it is therefore recommended to have a debug flag and turn them off for operational deployments, eg:
// Put this outside the map/combine/reduce functions, ie in global space var logger = org.apache.log4j.Logger.getLogger("path.to.package"); // eg "com.ikanow.infinit.e.utility.hadoop.HadoopPrototypingTool" var debug = false; //Within the map/combine/reduce functions it can be called like: if (debug) logger.info("Info message!"); // (discarded unless in debug mode)
Converting JSON to strings
For logging, it is often useful to output an entire JSON object as a string. Unfortunately the standard "JSON.stringify" function is not available in the version of Rhino currently used. Therefore this is in general not currently possible. The "val" parameter's string representation can be accessed as the globals "_map_input_value", "_combine_input_values" or "_reduce_input_values" (and the string representations of the key objects in the mapper/reducer/combiner can be accessed via "_map_input_key", "_combine_input_key" and "_reduce_input_key".
Accessing the results of other custom jobs
For any jobs declared in the query using the "$caches" modifier (described eg here), records from those other jobs can be accessed via:
var firstRecord = _custom['JOB_TITLE_OR_ID'].get(); // grabs the first element in the job var record = _custom['JOB_TITLE_OR_ID'].get('KEY'); // grabs the element with key matching the string parameter ('KEY' here for example)
There are a couple of limitations on the get function described above:
- the key must be of string type, or
- an object containing a single (consistent) string value, eg {"key":"value"} is fine, {"key":"value","other_key":2} is not.
USEFUL HINT: if the "_custom['JOB_TITLE_OR_ID'].get('KEY') is returning null unexpectedly (eg for an element you know exists), then try emitting ".get()" (ie the first element as above), to check that you are looking at the right table.
NOTE: the lookup may not work if the key has possible multiple fields, even if only one can be set at a time, eg if the first element is { "key": { "field1":"value"}, /*...*/ } and the second element is { "key": { "field2":"value2"}, /*...*/ } then a get("value2") will not work.
Aliasing
Currently document/entity objects accessed are not aliased. (Since the data model JAR is available, in theory the driver's getAliases call is available; in practice this would be fair bit of work to integrate).
It is on the roadmap to support aliasing within the Plugin framework, at which point it will be available in the javascript engine also.
Performance mode
If the global Javascript variable "_memoryOptimization" is set to true, eg:
// Start of JS code _memoryOptimization = true; // default false // Rest of JS code //...
Then the data is written and read from the Hadoop framework record-by-record instead of in bulk. This is slightly slower but much more memory efficient (and the default Hadoop set-up only allows about 1.5GB of data - the BSON->JSON multiplier is about x10), and therefore should be used whenever large numbers of records are being processed.
Note that if using "_memoryOptimization" the "vals" array passed into the reducer is a collection not an array, this means that the following code constructs:
for (var i in vals) { var val = vals[i]; //etc }
should be replaced by:
while (vals.hasNext()) { var val = vals.next(); //etc }
Security considerations
Administrators have no security constraints on what operations can be performed within the javascript. Non-administrators (if the system wide configuration parameter "harvest.security" is set) cannot access local files or ports, and other standard sandboxing.
Examples
Geo aggregation example
The following is a simple example of using the Hadoop JS interface to aggregate the sentiment associated with geo-tagged documents:
function map(key, val) { var label_lat = Math.round(val.docGeo.lat/10)*10; var label_lon = Math.round(val.docGeo.lon/10)*10; var label = label_lat.toString()+ ':' + label_lon.toString(); for (ent_i in val.entities) { var ent = val.entities[ent_i]; if ((null != ent.sentiment) && (ent.type == "Keyword")) { emit({label: label, label_lat: label_lat , label_lon: label_lon }, {sentiment: ent.sentiment, count: 1}); } } } function reduce(key, vals) { var retval = { sentiment: 0.0, count: 0 }; for (x in vals) { retval.sentiment += parseFloat(vals[x].sentiment); retval.count += parseInt(vals[x].count); } retval.geo = {}; retval.geo.lat = key.label_lat; retval.geo.lon = key.label_lon; emit(key,retval); } combine = reduce;
Temporal aggregation example
The following is a simple example of using the Hadoop JS interface to aggregate the sentiment across time:
function map(key, val) { // Get sentiment: var plusSentiment = 0.0; var negSentiment = 0.0; var sentimentCnt = 0; if (null != val.entities) { for (x in val.entities) { var entity = val.entities[x]; if (null != entity.sentiment) { if (entity.sentiment > 0) { sentimentCnt++; plusSentiment += entity.sentiment; } if (entity.sentiment < 0) { sentimentCnt++; negSentiment += entity.sentiment; } } } } if (sentimentCnt > 0) { var date = new java.text.SimpleDateFormat('yyyy-MM-dd').parse(val.publishedDate["$date"]); emit( {'day': date.getTime()}, { plus: plusSentiment, minus: negSentiment, cnt: sentimentCnt } ); } } function reduce(key, vals) { var initVal = { plus: 0.0, minus: 0.0, cnt: 0, net: 0.0, netavg: 0.0 }; for (x in vals) { initVal.plus += vals[x].plus; initVal.minus += vals[x].minus; initVal.cnt += vals[x].cnt; } initVal.net = initVal.plus + initVal.minus; initVal.netavg = initVal.net/initVal.cnt; initVal.date = new Date(key.day).toString(); emit( key, initVal ); } combine = reduce;