Overview

This toolkit element allows you to use regex or javascript to set the document metadata fields (eg title, description, publishedDate).

Format

TODO convert to JSON

{
	"display": string,
	"docMetadata": {} // see DocumentSpecPojo below
}
//////////////////////////////////
 
	public static class DocumentSpecPojo {
		public String title; // The string expression or $SCRIPT(...) specifying the document title
		public String description; // The string expression or $SCRIPT(...) specifying the document description
		public String publishedDate; // The string expression or $SCRIPT(...) specifying the document publishedDate
		public String fullText; // The string expression or $SCRIPT(...) specifying the document fullText
		public String displayUrl; // The string expression or $SCRIPT(...) specifying the document displayUrl
		public Boolean appendTagsToDocs; // if true (*NOT* default) source tags are appended to the document 
		public StructuredAnalysisConfigPojo.GeoSpecPojo geotag; // Specify a document level geo-tag
	}

Legacy documentation:

StructuredAnalysis object

TODO

Description

You can use document metadata to set specific values for a document's metadata.

Setting Metadata Values

When structured data is extracted from a source (via the File, Database, or other harvester), each field extracted is captured in the Feed.metadata object. Within the Structured Analysis Harvester data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.

Examples

title, description

The following example can be used to demonstrate how to extract the title, decription, and other parameters from ingested data.

source : {
   ... 
   structuredAnalysis : {
        docGeo : {"lat":"$metadata.latitude","lon":"$metadata.longitude"},
        description : "$metadata.reportdatetime: $metadata.offense,$metadata.method was reported at: $metadata.blocksiteaddress",
		//other document level fields, see reference
        entities : [
            {disambiguous_name:"$metadata.offense,$metadata.method", dimension:"What", 
                type:"CriminalActivity"},
            {disambiguous_name:"$metadata.blocksiteaddress,$metadata.city,$metadata.state",
                dimension:"Where",type:"Place", geotag: {latitude:"$metadata.latitude",
                longitude:"$metadata.longitude"}}],
        "associations" : [ 
            {entity1:"$metadata.offense,$metadata.method",verb:"reported",verb_category:"crime",
                time_start:"$metadata.reportdatetime","geo_index" : "Location", 
                geotag: {lat:"$metadata.latitude",lon:"$metadata.longitude"} }]
   }
   ...
}

In the document above you can extract the Offense field using the following syntax:

$metadata.offense or ${metadata.offense}

Other fields at the document top level ("$title", "$description", etc) can also be referenced this way

Note: When data is extracted and added to the Metadata object all field name are converted to lowercase.

Note: If the metadata field is an array, the above syntax grabs the first element only. To go deeper into arrays, javascript must be used.

Note: When iterating over entities or metadata (for either entity or association building), the "$" sign is relative to the iterator, not the document (eg the metadata object being looped over). However when iterating over metadata fields that are strings, then the above document-level referencing is still valid, or "$value"/"${value}" can be used to reference the value itself.

published date

You can use the following example formats, to extract the publishedDate.

		if (null == _allowedDatesArray_startsWithLetter) 
		{
			_allowedDatesArray_startsWithLetter = new String[] {
					DateFormatUtils.SMTP_DATETIME_FORMAT.getPattern(),
					
					"MMM d, yyyy hh:mm a",
					"MMM d, yyyy HH:mm",
					"MMM d, yyyy hh:mm:ss a",
					"MMM d, yyyy HH:mm:ss",
					"MMM d, yyyy hh:mm:ss.SS a",
					"MMM d, yyyy HH:mm:ss.SS",
					
					"EEE MMM dd HH:mm:ss zzz yyyy",
					"EEE MMM dd yyyy HH:mm:ss zzz",
					"EEE MMM dd yyyy HH:mm:ss 'GMT'Z (zzz)",					
			};					
			_allowedDatesArray_numeric_1 = new String[] {
					"yyyy-MM-dd'T'HH:mm:ss'Z'",
					DateFormatUtils.ISO_DATE_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATE_TIME_ZONE_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATETIME_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATETIME_TIME_ZONE_FORMAT.getPattern()
			};
			_allowedDatesArray_numeric_2 = new String[] {					
					"yyyyMMdd",
					"yyyyMMdd hh:mm a",
					"yyyyMMdd HH:mm",
					"yyyyMMdd hh:mm:ss a",
					"yyyyMMdd HH:mm:ss",
					"yyyyMMdd hh:mm:ss.SS a",
					"yyyyMMdd HH:mm:ss.SS",
					// Julian, these are unlikely
					"yyyyDDD",
					"yyyyDDD hh:mm a",
					"yyyyDDD HH:mm",
					"yyyyDDD hh:mm:ss a",
					"yyyyDDD HH:mm:ss",
					"yyyyDDD hh:mm:ss.SS a",
					"yyyyDDD HH:mm:ss.SS",
				};
			_allowedDatesArray_stringMonth = new String[] {
					"dd MMM yy",
					"dd MMM yy hh:mm a",
					"dd MMM yy HH:mm",
					"dd MMM yy hh:mm:ss a",
					"dd MMM yy HH:mm:ss",
					"dd MMM yy hh:mm:ss.SS a",
					"dd MMM yy HH:mm:ss.SS",
				};
			_allowedDatesArray_numericMonth = new String[] {
					"MM dd yy",
					"MM dd yy hh:mm a",
					"MM dd yy HH:mm",
					"MM dd yy hh:mm:ss a",
					"MM dd yy HH:mm:ss",
					"MM dd yy hh:mm:ss.SS a",
					"MM dd yy HH:mm:ss.SS",
			};
		}

If the date doesn't match one of these formats, add a function along the following lines in the globals script:

// substitue YOUR.DATE.FIELD, and the date format
function createPubDate(metadata) {
    var date = metadata.YOUR.DATE.FIELD;
    var parsedDate = new java.text.SimpleDateFormat('MM/dd/yyyy hh:mm:ss a (zzz)').parse(date);
    return '' + parsedDate.toString();
}

and then you can call it from the docMetadata.publishedDate field like:

{
	"docMetadata": {
		//...	
		publishedDate: "$SCRIPT( createPubDate(_doc.metadata) );
		//...
	}
}

displayUrl

"displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:

If it starts with "http://" then it is treated as a web link
Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).

IN PROGRESS