Using Javascript
Overview
Some of the pipeline processing elements enable you to use javascript to obtain metadata, create entities/associations, and perform various other operations.
For example, the following list of elements can make use of javascript:
- docMetadata
- contentMetadata
- text extraction
- criteria
- Follow Web Links/split
For the most part, examples involving javascript are described on the individual pages that define and describe these elements.
This page serves to set out some general guidelines and best practices for using javascript.
About The Javascript Framework
The Infinit.e platform supports scripting the transformation of source data using JavaScript via Rhino, Mozilla's open-source JavaScript implementation (http://www.mozilla.org/rhino/).
Note that unless turned off from the configuration files (via the "harvest.security" property), Javascript is prevented by the Java security manager from doing the following:
- "Internal" network access (ie to addresses 127.*.*.*, 10.*.*.* or 192.168.*.*)
- File access.
Importing
The Infinit.e Structured Analysis Harvester supports importing of JavaScript functions in two ways currently:
- specifying a javascript code block
- a list or urls of javascript locations that can be imported.
Javascript Use Cases
In general terms, the use of javascript for Infinit.e falls into several major categories
- obtaining metadata
- creating entities and associations
- using javascript for criteria, and other miscellaneous scenarios
Obtaining Metadata
When data is ingested into Infinit.e it is converted into documents. The various elements of the processing pipeline can then act on these documents to get metadata. Metadata objects can then be made available to functions and inline scripts.
You can get configure how you will get the metadata out of the text or metadata by setting script flags. For example, you can receive the metadata as _doc, _metadata, or as full text.
Also, when iterating over a JSON array each item in the array is passed into the ScriptEngine and is made accessible via an object named: _iterator.
Examples
_doc.metadata
"city": "$SCRIPT( return _doc.metadata.location[0].citystateprovince.city; )",
_metadata
"contentMetadata": [ { "fieldName": "email_meta", "script": "var x=_metadata._FILE_METADATA_[0].metadata;x;", "scriptlang": "javascript", "flags": "m"
iterator
var make = _iterator.make; var model = _iterator.model; var year = _iterator.year;
For more information about using javascript to get metadata and for detailed examples and descriptions, see Content metadata, Manual text transformation.
Creating Entities and Associations
You can use javascript to create entities and associations by calling the metadata using the $SCRIPT and $FUNC scripting conventions.
For more information about using javascript to create entities and associations, see Manual entities, Manual association of entities.
Criteria
Criteria is a common field shared by all of the pipeline elements.
It can be used to specify a javascript expression which can control the order in which entities extractors are applied to the ingested documents. The javascript expression can be setup to choose entities extractors, based on the content and metadata extracted so far in the pipeline.
For example, Infinit.e supports the Open Calais, and Salience extraction engines.
The criteria field is of most use with the following pipeline processing elements
- automated text extraction
- feature extraction
For more information about use of criteria, and for detailed examples see Automated text extraction, and Feature extraction.
$PATH, $SETPATH and $SCRIPT
Additionally to criteria condition representing logical conditions for extraction, some criteria values will be generated if the pipeline contains conditional elements:
- each conditional element creates a $SETPATH(<branchA>,<branchB>) statement. As an example, a conditional element having node-id =3 would create $SETPATH(3_True,3_False)
- subsequent elements will have a $PATH(<branch>) statement as part iof the criteria value. A node in the True-branch placed after the conditional node (id=3) would have $PATH(3_True) as part of the criteria statement.
- logical conditions for allowing to control the order in which entities extractors are applied will still be placed within a $SCRIPT() statement
These $PATH, $SETPATH and $SCRIPT statements are internally assembled by the flow-builder and will become part of criteria fields of the elements.
If the sourceBuilder() function creates more than one source element per node, the criteria script will be generated for all elements. However, there is one exception:
Only the last element created by a conditional node will contain the criteria value.
Creation Criteria Scripts
Both entity and association specification objects provide a field called "creationCriteriaScript". This must be JavaScript (though you still need to set the engine and enclose in either $SCRIPT or $FUNC), and you can return one of two things from it:
- A boolean, in which case the entity object is only created if
- A string, in which case any non-null string is treated like a boolean false, and in addition the string is logged as an error that can be accessed from the "harvest.harvest_message" field of sources.
The creation criteria script is executed before any other scripts in the specification object.