Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This toolkit element allows you to use regex or javascript to set the document metadata fields (eg title, description, publishedDate).

TODO

Table of Contents

Format

TODO convert to JSON

...

TODO

Description

Legacy documentation:

TODO

The following formats are currently supported:

You can use document metadata to set specific values for a document's metadata.

Setting Metadata Values

When structured data is extracted from a source (via the File, Database, or other harvester), each field extracted is captured in the Feed.metadata object. Within the Structured Analysis Harvester data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.

Examples

title, description

The following example can be used to demonstrate how to extract the title, decription, and other parameters from ingested data.

source : {
   ... 
   structuredAnalysis : {
        docGeo : {"lat":"$metadata.latitude","lon":"$metadata.longitude"},
        description : "$metadata.reportdatetime: $metadata.offense,$metadata.method was reported at: $metadata.blocksiteaddress",
		//other document level fields, see reference
        entities : [
            {disambiguous_name:"$metadata.offense,$metadata.method", dimension:"What", 
                type:"CriminalActivity"},
            {disambiguous_name:"$metadata.blocksiteaddress,$metadata.city,$metadata.state",
                dimension:"Where",type:"Place", geotag: {latitude:"$metadata.latitude",
                longitude:"$metadata.longitude"}}],
        "associations" : [ 
            {entity1:"$metadata.offense,$metadata.method",verb:"reported",verb_category:"crime",
                time_start:"$metadata.reportdatetime","geo_index" : "Location", 
                geotag: {lat:"$metadata.latitude",lon:"$metadata.longitude"} }]
   }
   ...
}

 

In the document above you can extract the Offense field using the following syntax:

Code Block
$metadata.offense or ${metadata.offense}

Other fields at the document top level ("$title", "$description", etc) can also be referenced this way

Info

Note: When data is extracted and added to the Metadata object all field name are converted to lowercase.

Info

Note: If the metadata field is an array, the above syntax grabs the first element only. To go deeper into arrays, javascript must be used.

Info

Note: When iterating over entities or metadata (for either entity or association building), the "$" sign is relative to the iterator, not the document (eg the metadata object being looped over). However when iterating over metadata fields that are strings, then the above document-level referencing is still valid, or "$value"/"${value}" can be used to reference the value itself.

published date

You can use the following example formats, to extract the publishedDate.

Code Block
		if (null == _allowedDatesArray_startsWithLetter) 
		{
			_allowedDatesArray_startsWithLetter = new String[] {
					DateFormatUtils.SMTP_DATETIME_FORMAT.getPattern(),
					
					"MMM d, yyyy hh:mm a",
					"MMM d, yyyy HH:mm",
					"MMM d, yyyy hh:mm:ss a",
					"MMM d, yyyy HH:mm:ss",
					"MMM d, yyyy hh:mm:ss.SS a",
					"MMM d, yyyy HH:mm:ss.SS",
					
					"EEE MMM dd HH:mm:ss zzz yyyy",
					"EEE MMM dd yyyy HH:mm:ss zzz",
					"EEE MMM dd yyyy HH:mm:ss 'GMT'Z (zzz)",					
			};					
			_allowedDatesArray_numeric_1 = new String[] {
					"yyyy-MM-dd'T'HH:mm:ss'Z'",
					DateFormatUtils.ISO_DATE_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATE_TIME_ZONE_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATETIME_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATETIME_TIME_ZONE_FORMAT.getPattern()
			};
			_allowedDatesArray_numeric_2 = new String[] {					
					"yyyyMMdd",
					"yyyyMMdd hh:mm a",
					"yyyyMMdd HH:mm",
					"yyyyMMdd hh:mm:ss a",
					"yyyyMMdd HH:mm:ss",
					"yyyyMMdd hh:mm:ss.SS a",
					"yyyyMMdd HH:mm:ss.SS",
					// Julian, these are unlikely
					"yyyyDDD",
					"yyyyDDD hh:mm a",
					"yyyyDDD HH:mm",
					"yyyyDDD hh:mm:ss a",
					"yyyyDDD HH:mm:ss",
					"yyyyDDD hh:mm:ss.SS a",
					"yyyyDDD HH:mm:ss.SS",
				};
			_allowedDatesArray_stringMonth = new String[] {
					"dd MMM yy",
					"dd MMM yy hh:mm a",
					"dd MMM yy HH:mm",
					"dd MMM yy hh:mm:ss a",
					"dd MMM yy HH:mm:ss",
					"dd MMM yy hh:mm:ss.SS a",
					"dd MMM yy HH:mm:ss.SS",
				};
			_allowedDatesArray_numericMonth = new String[] {
					"MM dd yy",
					"MM dd yy hh:mm a",
					"MM dd yy HH:mm",
					"MM dd yy hh:mm:ss a",
					"MM dd yy HH:mm:ss",
					"MM dd yy hh:mm:ss.SS a",
					"MM dd yy HH:mm:ss.SS",
			};
		}

...

Code Block
{
	"docMetadata": {
		//...	
		publishedDate: "$SCRIPT( createPubDate(_doc.metadata) );
		//...
	}
}

Examples

...

 

displayUrl

"displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:

  • If it starts with "http://" then it is treated as a web link
  • Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document  - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).

 

IN PROGRESS

Legacy documentation:

TODO