Format
...
Code Block |
---|
{
"display": string,
"docMetadata": { |
...
"title":string,// The string expression or $SCRIPT(...) specifying the document title "description":string,// The string expression or $SCRIPT(...) specifying the document |
...
description |
...
"publishedDate":string,// The string expression or $SCRIPT(...) specifying the document |
...
publishedDate |
...
"mediaType": string, // The string expression or $SCRIPT(...) specifying the document |
...
mediaType (otherwise taken from top-level source field) |
...
"tags": string, // |
...
A ,-separated list of string |
...
expressions or $SCRIPT(...) |
...
- returning a ,-separated list, the result of each will be added to the tags "fullText":string,// The string expression or $SCRIPT(...) specifying the document |
...
fullText |
...
"displayUrl":string,//The |
...
string |
...
expression or |
...
$SCRIPT(...) specifying the document displayUrl "appendTagsToDocs":Boolean,// if true, source tags are appended to the document |
...
. |
...
|
...
|
...
Default |
...
value |
...
is |
...
false. |
...
"geotag": {config_param_name"},//Specify a document level geo-tag } |
...
Legacy documentation:
...
} |
Description
You can
...
use docMetadata
to set specific values for a document's metadata.
...
Setting Metadata Values
When structured data is extracted from a source (via the File, Database, or other harvester), each field extracted is captured in the Feed.metadata object. Within the Structured Analysis Harvester data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.
Examples
title, description
The following example can be used to demonstrate how to extract the title, decription,
and other parameters from ingested data.
source : {
...
structuredAnalysis : {
docGeo : {"lat":"$metadata.latitude","lon":"$metadata.longitude"},
description : "$metadata.reportdatetime: $metadata.offense,$metadata.method was reported at: $metadata.blocksiteaddress",
//other document level fields, see reference
entities : [
{disambiguous_name:"$metadata.offense,$metadata.method", dimension:"What",
type:"CriminalActivity"},
{disambiguous_name:"$metadata.blocksiteaddress,$metadata.city,$metadata.state",
dimension:"Where",type:"Place", geotag: {latitude:"$metadata.latitude",
longitude:"$metadata.longitude"}}],
"associations" : [
{entity1:"$metadata.offense,$metadata.method",verb:"reported",verb_category:"crime",
time_start:"$metadata.reportdatetime","geo_index" : "Location",
geotag: {lat:"$metadata.latitude",lon:"$metadata.longitude"} }]
}
...
}
In the document above you can extract the Offense field using the following syntax:
Code Block |
---|
$metadata.offense or ${metadata.offense}
|
Other fields at the document top level ("$title", "$description", etc) can also be referenced this way
Info |
---|
Note: When data is extracted and added to the Metadata object all field name are converted to lowercase. |
Info |
---|
Note: If the metadata field is an array, the above syntax grabs the first element only. To go deeper into arrays, javascript must be used. |
Info |
---|
Note: When iterating over entities or metadata (for either entity or association building), the "$" sign is relative to the iterator, not the document (eg the metadata object being looped over). However when iterating over metadata fields that are strings, then the above document-level referencing is still valid, or "$value"/"${value}" can be used to reference the value itself. |
published date
You can use the following example formats, to extract the publishedDate.
...
docMetadata
has the following parameters
Parameter | Description |
---|---|
title | The string expression or $SCRIPT(...) |
description | The string expression or $SCRIPT(...) |
publishedDate | The string expression or $SCRIPT(...) - must return one of the supported data formats below |
mediaType | The string expression or $SCRIPT(...) specifying the document mediaType (otherwise taken from top-level source field) |
tags | A ,-separated list of string expressions or $SCRIPT(...) - returning a ,-separated list, the result of each will be added to the tags |
fullText | The string expression or $SCRIPT(...) |
displayUrl | "displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:
|
appendTagsToDocs | If true, source tags are appended to the document. Default value is false. |
geoTag | using geo tag the following is possible
"lat": "string", "lon": "string, "city": "string", "stateProvince": "string, "country": "string", "countryCode": "string See example below. |
Supported Date Formats
Code Block |
---|
"MMM d, yyyy hh:mm a", "MMM d, yyyy HH:mm", "MMM d, yyyy hh:mm:ss a", "MMM d, yyyy HH:mm:ss", "MMM d, yyyy hh:mm:ss.SS a", "MMM d, yyyy HH:mm:ss.SS", "EEE MMM dd HH:mm:ss zzz yyyy", "EEE MMM dd yyyy HH:mm:ss zzz", "EEE MMM dd yyyy HH:mm:ss 'GMT'Z (zzz)", "yyyy-MM-dd'T'HH:mm:ss'Z'", "yyyyMMdd", "yyyyMMdd hh:mm a", "yyyyMMdd HH:mm", "yyyyMMdd hh:mm:ss a", "yyyyMMdd HH:mm:ss", "yyyyMMdd hh:mm:ss.SS a", "yyyyMMdd HH:mm:ss.SS", "yyyyDDD", "yyyyDDD hh:mm a", "yyyyDDD HH:mm", "yyyyDDD hh:mm:ss a", " |
...
yyyyDDD HH:mm:ss", " |
...
yyyyDDD hh:mm:ss.SS a", " |
...
yyyyDDD HH:mm:ss.SS", "dd MMM |
...
yy", "dd MMM yy hh:mm |
...
a", "dd MMM |
...
yy HH:mm |
...
",
|
...
|
...
|
...
" |
...
dd MMM |
...
yy |
...
hh:mm:ss |
...
a", " |
...
dd MMM |
...
yy HH:mm:ss |
...
",
" |
...
dd MMM |
...
yy |
...
hh:mm:ss |
...
.SS a", "dd MMM yy HH:mm:ss.SS", |
...
|
...
"MM dd yy", |
...
"MM dd yy hh:mm a", " |
...
MM |
...
dd yy HH:mm |
...
",
|
...
"MM dd yy hh:mm:ss a", |
...
"MM dd yy HH:mm:ss", |
...
"MM dd yy hh:mm:ss.SS a", |
...
|
...
"MM dd yy HH:mm:ss.SS", " |
...
EEE, |
...
dd MMM yyyy HH:mm |
...
:ss Z" (SMTP) " |
...
yyyy-MM-dd" (ISO date) " |
...
yyyy-MM-ddZZ" (ISO date time-zone) " |
...
yyyy-MM-dd'T'HH:mm:ss |
...
" (ISO datetime) " |
...
yyyy-MM-dd'T'HH:mm: |
...
ssZZ" |
...
If the date doesn't match one of these formats, add a function along the following lines in the globals script:
Code Block |
---|
// substitue YOUR.DATE.FIELD, and the date format
function createPubDate(metadata) {
var date = metadata.YOUR.DATE.FIELD;
var parsedDate = new java.text.SimpleDateFormat('MM/dd/yyyy hh:mm:ss a (zzz)').parse(date);
return '' + parsedDate.toString();
} |
and then you can call it from the docMetadata.publishedDate field like:
Code Block |
---|
{
"docMetadata": {
//...
publishedDate: "$SCRIPT( createPubDate(_doc.metadata) );
//...
}
} |
displayUrl
"displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:
- If it starts with "http://" then it is treated as a web link
- Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).
IN PROGRESS
Legacy documentation:
TODO
...
(ISO datetime time-zone) |
If the date matches none of those, it is passed to the JChronic NLP package, however that has a low success rate.
Examples
Setting Metadata Values
When document metadata is extracted from a source (via the File, Database, or other technique), each field extracted is captured in the Feed.metadata object. Using document metadata, data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.
Web Feed Example (Twitter)
In the example, the docMetadata
block references metadata objects using the $ operator. $SCRIPT is used to return variables which can then be transformed further.
The script is used to define the following parameters for the document metadata
- title
- description
- fulltext
- publisheddate
Code Block |
---|
}, {
"docMetadata": {
"title": "$metadata.json.body",
"description": "$metadata.json.body",
"fullText": "$metadata.json.body",
"publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)",
"geotag": {
"lat": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[0];} catch (err) {return '';})",
"lon": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[1];} catch (err) {return '';})"
}
}
}, |
"Office" Documents Example
In this example, the subject line of an email correspondence can be extracted by Document metadata and set as the title of the resulting document.
Code Block |
---|
}, {
"docMetadata": {
"title": "$SCRIPT( return _doc.metadata._FILE_METADATA_[0].metadata.subject[0];)"
}
}, |
In the sample output we can see the "title" that was set using the docMetadata
script.
Code Block |
---|
{
"_id": "5048efb0e4b01fd6455420ee",
"title": "RE: Testing Preschedule workspace",
"url": "smb://modus:139/enron/testing/semperger-c/deleted_items/37QTKE~3",
"created": "Sep 6, 2012 06:42:01 PM UTC",
"modified": "Jul 24, 2012 01:13:02 AM UTC",
"publishedDate": "Jul 9, 2001 06:33:32 PM UTC",
"source": [
"Enron Emails (TextRank)"
],
"sourceKey": [
"modus.139.enron.testing.."
],
"mediaType": [
"Email"
],
"description": "I am trying to pull it up now, it's taking a long time\r\n\r\n \r\nFrom: \tSmith, Will \r\nSent:\tMonday, July 09, 2001 11:28 AM\r\nTo:\tSemperger, Cara\r\nSubject:\tRE: Testing Preschedule workspace\r\n\r\nYes, but Vish made the changes in Table Edit. : - )\r\n\r\nWill\r\n\r\n \r\nFrom: \tSemperger, Cara \r\nSent:\tMonday, July 09, 2001 1:20 PM\r\nTo:\tSmith, Will\r\nSubject:\tRE: Testing Preschedule workspace\r\n\r\nSo, this table edit that Brett is asking me to test is really from ", |
Anchor | ||||
---|---|---|---|---|
|
In the example $SCRIPT is used to set the values for geotag elements city, country, and stateProvince. It references functions and variables imported by globals.
Code Block |
---|
}, {
"docMetadata": {
"title": "$metadata.subject",
"description": "$metadata.summary",
"publishedDate": "$metadata.incidentdate",
"geotag": {
"city": "$SCRIPT( return _doc.metadata.location[0].citystateprovince.city; )",
"country": "$SCRIPT( return _doc.metadata.location[0].country; )",
"stateProvince": "$SCRIPT( return _doc.metadata.location[0].citystateprovince.stateprovince; )"
}
}
}, |
Globals:
Code Block |
---|
{
"globals": {
"scripts": [
"function getLocationEntity() { var s = (_iterator.citystateprovince.city != null) ? _iterator.citystateprovince.city : ''; s+= (s.length > 0) ? ',' : ''; s+= (_iterator.citystateprovince.stateprovince != null) ? _iterator.citystateprovince.stateprovince : ''; s+= (s.length > 0) ? ',' : ''; s+= (_iterator.country != null) ? _iterator.country : ''; return s; } function getVictim() { var indicator = (_iterator.indicator != 'Unknown') ? _iterator.indicator : ''; var victimType = (_iterator.victimtype != 'Unknown') ? _iterator.victimtype : ''; var child = (_iterator.child == 'Yes') ? 'Child' : 'Adult'; var combatant = (_iterator.combatant == 'Yes') ? 'Combatant' : ''; var targeted = (_iterator.targetedcharacteristic != 'None' && _iterator.targetedcharacteristic != 'Unknown') ? _iterator.targetedcharacteristic : ''; var defining = (_iterator.definingcharacteristic != 'None' &&_iterator.definingcharacteristic != 'Unknown') ? _iterator.definingcharacteristic : ''; var s = indicator; if (victimType.length > 0) { if (s.length > 0) { s += ', '; } s += victimType; } if (s.length > 0) { s += ', '; } s += child; if (combatant.length > 0) { if (s.length > 0) { s += ', '; } s += combatant; } if (targeted.length > 0) { if (s.length > 0) { s += ', '; } s += targeted; } if (defining.length > 0) { if (s.length > 0) { s += ', '; } s += defining; } if (s.length > 0) { s += ' from '; } s += _iterator.nationality; return s; } function getVictimCount() { var count = parseInt(_iterator.deadcount, 10) + parseInt(_iterator.woundedcount, 10); return count; } function getEventType() { var s = _value; if (_doc.metadata.assassination[0] == 'Yes') s += ', Assassination'; if (_doc.metadata.suicide[0] == 'Yes') s += ', Suicide'; if (_doc.metadata.ied[0] == 'Yes') s += ', IED'; return s; } function getEventTypeFull() { var s = _doc.metadata.eventtype[0]; if (_doc.metadata.assassination[0] == 'Yes') s += ', Assassination'; if (_doc.metadata.suicide[0] == 'Yes') s += ', Suicide'; if (_doc.metadata.ied[0] == 'Yes') s += ', IED'; return s;} function isOrganizationSpecified() { if (_doc.metadata.organization != null && _doc.metadata.organization[0].toString().toLowerCase() == 'no group') { return false; } else { return true; } }function getOrganizationName() { if (_doc.metadata.organization != null && _doc.metadata.organization[0].toString().toLowerCase() != 'no group') { return _doc.metadata.organization[0]; } }"
]
} |
Output:
The output of the example source, returns the location information pertaining to the source data.
Code Block |
---|
"location": [{ "citystateprovince": {
"city": "Manugay",
"stateprovince": "Kunar"
},
"country": "Afghanistan",
"region": "South Asia" |
Panel |
---|
Footnotes: Legacy documentation: Legacy documentation: |