Overview
This toolkit element allows you to use regex or javascript to set the document metadata fields (eg title, description, publishedDate).
Format
TODO convert to JSON
{ "display": string, "docMetadata": {} // see DocumentSpecPojo below } ////////////////////////////////// public static class DocumentSpecPojo { public String title; // The string expression or $SCRIPT(...) specifying the document title public String description; // The string expression or $SCRIPT(...) specifying the document description public String publishedDate; // The string expression or $SCRIPT(...) specifying the document publishedDate public String fullText; // The string expression or $SCRIPT(...) specifying the document fullText public String displayUrl; // The string expression or $SCRIPT(...) specifying the document displayUrl public Boolean appendTagsToDocs; // if true (*NOT* default) source tags are appended to the document public StructuredAnalysisConfigPojo.GeoSpecPojo geotag; // Specify a document level geo-tag }
Description
You can use docMetadata
to set specific values for a document's metadata. docMetadata
has the following parameters
Parameter | Description | Note |
---|---|---|
title | The string expression or $SCRIPT(...) | |
description | Ibid. | |
publishedDate | Ibid. | |
fullText | Ibid. | |
displayUrl | "displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:
| |
appendTagsToDocs | ||
geoTag | using geo tag the following is possible
See example below. |
Setting Metadata Values
When document metadata is extracted from a source (via the File, Database, or other technique), each field extracted is captured in the Feed.metadata object. Using document metadata, data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.
Web Feed Example (Twitter)
In the example, the docMetadata
object is a complex type with a series of parameters which point to a script "$metadata.json.body"
The script is used to define the following parameters for the document metadata
- title
- description
- fulltext
- publisheddate
}, { "docMetadata": { "title": "$metadata.json.body", "description": "$metadata.json.body", "fullText": "$metadata.json.body", "publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)", "geotag": { "lat": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[0];} catch (err) {return '';})", "lon": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[1];} catch (err) {return '';})" } } },
We can see in this example that the script is used to parse the data from the twitter feed, in order to set the metadata values.
], "fullText": "$metadata.json.body", "script": "function getAddressVal( addressStr, number) { try { var addressArray = addressStr.split(/ *, */); if (addressArray != null && addressArray.length > 0) { if (addressArray[number].toLowerCase()=='ny') { return 'new york'; } else if (addressArray[number].toLowerCase()=='long island' || addressArray[number].toLowerCase()=='li') { return 'medford'; } else { return addressArray[number]; } } else { return ''; } } catch (err) { return ''; } } function getRegion( code ) { if (code.toLowerCase()=='ny') {return 'New York';} else if (code.toLowerCase()=='nj') {return 'New Jersey';} else if (code.toLowerCase()=='ct') {return 'Connecticut';} else if (code.toLowerCase()=='md') {return 'Maryland';} else if (code.toLowerCase()=='va') {return 'Virginia';} else if (code.toLowerCase()=='pa') {return 'Pennsylvania';} else if (code.toLowerCase()=='nj') {return 'New Jersey';} else {return 'New York';} }", "scriptEngine": "javascript", "title": "$metadata.json.body", "url": "$metadata.json.link", "publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)" },
Note: When iterating over entities or metadata (for either entity or association building), the "$" sign is relative to the iterator, not the document (eg the metadata object being looped over). However when iterating over metadata fields that are strings, then the above document-level referencing is still valid, or "$value"/"${value}" can be used to reference the value itself.
"Office" documents Example
In this example, the subject line of an email correspondence can be extracted by Document metadata and set as the title of the resulting document.
}, { "docMetadata": { "title": "$SCRIPT( return _doc.metadata._FILE_METADATA_[0].metadata.subject[0];)" } },
In the sample output we can see that title has been set using the docMetadata
script, followed by an array of JSON objects.
{ "_id": "5048efb0e4b01fd6455420ee", "title": "RE: Testing Preschedule workspace", "url": "smb://modus:139/enron/testing/semperger-c/deleted_items/37QTKE~3", "created": "Sep 6, 2012 06:42:01 PM UTC", "modified": "Jul 24, 2012 01:13:02 AM UTC", "publishedDate": "Jul 9, 2001 06:33:32 PM UTC", "source": [ "Enron Emails (TextRank)" ], "sourceKey": [ "modus.139.enron.testing.." ], "mediaType": [ "Email" ], "description": "I am trying to pull it up now, it's taking a long time\r\n\r\n \r\nFrom: \tSmith, Will \r\nSent:\tMonday, July 09, 2001 11:28 AM\r\nTo:\tSemperger, Cara\r\nSubject:\tRE: Testing Preschedule workspace\r\n\r\nYes, but Vish made the changes in Table Edit. : - )\r\n\r\nWill\r\n\r\n \r\nFrom: \tSemperger, Cara \r\nSent:\tMonday, July 09, 2001 1:20 PM\r\nTo:\tSmith, Will\r\nSubject:\tRE: Testing Preschedule workspace\r\n\r\nSo, this table edit that Brett is asking me to test is really from ", "entities": [ { "disambiguated_name": "on- june 18-paloverde-day", "index": "on- june 18-paloverde-day/keyword", "actual_name": "on- june 18-paloverde-day", "type": "Keyword", "relevance": 0.10585404743253149, "frequency": 1, "totalfrequency": 12, "doccount": 12, "dimension": "What" }, { "disambiguated_name": "mulitple times additional data", "index": "mulitple times additional data/keyword", "actual_name": "mulitple times additional data", "type": "Keyword", "relevance": 0.18088061045762382, "frequency": 1, "totalfrequency": 12, "doccount": 12, "dimension": "What" },
Setting Metadata Values for Location
You can use document metadata to set location values using geoTag
. In the example source, the docMetadata block has been configured to use javascript to set the city, country and stateProvince. In this example, the javascript function and variables were already defined using the globals.
}, { "docMetadata": { "title": "$metadata.subject", "description": "$metadata.summary", "publishedDate": "$metadata.incidentdate", "geotag": { "city": "$SCRIPT( return _doc.metadata.location[0].citystateprovince.city; )", "country": "$SCRIPT( return _doc.metadata.location[0].country; )", "stateProvince": "$SCRIPT( return _doc.metadata.location[0].citystateprovince.stateprovince; )" } } },
Output:
The output of the example source, returns the location information pertaining to the source data.
"location": [{ "citystateprovince": { "city": "Manugay", "stateprovince": "Kunar" }, "country": "Afghanistan", "region": "South Asia"
Footnotes:
Legacy documentation:
Legacy documentation: