Structured Analysis - Overview
There is also a reference page for the Structured Analysis configuration object.
The Infinit.e Structured Analysis Harvester is designed to take data ingested from structured data sources (database tables, XML documents, etc.) and enrich the data via the assignment of geospatial information, entities and events. The Structured Analysis Harvester is also capable of transforming source data via basic string concatenation (using simple regular expression support) and more complex transformations using JavaScript. The example Source.structuredAnalysis object below demonstrates the basic features of specifying how to enrich harvested structured data.
source : { ... structuredAnalysis : { docGeo : {"lat":"$metadata.latitude","lon":"$metadata.longitude"}, description : "$metadata.reportdatetime: $metadata.offense,$metadata.method was reported at: $metadata.blocksiteaddress", //other document level fields, see reference entities : [ {disambiguous_name:"$metadata.offense,$metadata.method", dimension:"What", type:"CriminalActivity"}, {disambiguous_name:"$metadata.blocksiteaddress,$metadata.city,$metadata.state", dimension:"Where",type:"Place", geotag: {latitude:"$metadata.latitude", longitude:"$metadata.longitude"}}], "associations" : [ {entity1:"$metadata.offense,$metadata.method",verb:"reported",verb_category:"crime", time_start:"$metadata.reportdatetime","geo_index" : "Location", geotag: {lat:"$metadata.latitude",lon:"$metadata.longitude"} }] } ... }
Display URL
"displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:
- If it starts with "http://" then it is treated as a web link
- Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).
Using the $ Operator to Extract Document Data
When structured data is extracted from a source (via the File, Database, or other harvester), each field extracted is captured in the Feed.metadata object. Within the Structured Analysis Harvester data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document. For example, in the document above you can extract the Offense field using the following syntax:
$metadata.offense or ${metadata.offense}
Other fields at the document top level ("$title", "$description", etc) can also be referenced this way
Note: When data is extracted and added to the Metadata object all field name are converted to lowercase.
Note: If the metadata field is an array, the above syntax grabs the first element only. To go deeper into arrays, javascript must be used.
Note: When iterating over entities or metadata (for either entity or association building), the "$" sign is relative to the iterator, not the document (eg the metadata object being looped over). However when iterating over metadata fields that are strings, then the above document-level referencing is still valid, or "$value"/"${value}" can be used to reference the value itself.
Note: The $ sign can be escaped as ${$}.
Document updates and metadata
Existing documents can be updated in a number of different cases:
- Files can be updated (changing their "modified time")
- For RSS feeds/URLs, the source parameter "updateCycle_secs" will periodically update the file.
- Database sources can be updated as the result of a SQL call.
When a document is updated it is essentially equivalent to deleting and the re-creating it, except that its "_id" field is preserved). The Structured Analysis Harvester provides a mechanism to do the following useful activities:
- Preserve metadata from the old document (eg so the entities/associations can be recreated)
- Generate new metadata (and thence entities/associations) based on the differences between successive documents.
A script can be placed into ("onUpdateScript" - note the "$SCRIPT" convention used in entity/association scriptlets is not required here). This script has access to the following Javascript objects:
- "_old_doc": The document object that is about to be deleted
- "_doc": The newly created document object after all metadata/entity/association creation.
The last evaluated expression in the script (eg you don't "return val;" you just end the script "val;"), which can be a string, an object, or an array of objects is placed in a metadata field called "_PERSISTENT_". For example the following code just saves the entirety of the old document's metadata:
// SOURCE CONFIG: "structuredAnalysis": { "scriptEngine": "javascript", "onUpdateScript": "var retVal = _old_doc.metadata; retVal;" } // RESULT (IN THE CASE OF A DOCUMENT THAT DOESN'T CHANGE): { // Usual document fields "metadata": { "test1": "test", "test2": { "field": "value" }, "_PERSISTENT_": [{ "test1": "test", "test2": { "field": "value" }, }] } }
And the following script shows a very simple example of comparing the old and new documents:
"structuredAnalysis": { "scriptEngine": "javascript", "onUpdateScript": "var delta = _old_doc.metadata.length - _doc.metadata.length; var retVal = { 'delta': delta }; retVal;" }