Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
{
	"display": string,
	"docMetadata": {} // see DocumentSpecPojo below
}
//////////////////////////////////
 
	public static class DocumentSpecPojo {
		public String title; // The string expression or $SCRIPT(...) specifying the document title
		public String description; // The string expression or $SCRIPT(...) specifying the document description
		public String publishedDate; // The string expression or $SCRIPT(...) specifying the document publishedDate
		public String fullText; // The string expression or $SCRIPT(...) specifying the document fullText
		public String displayUrl; // The string expression or $SCRIPT(...) specifying the document displayUrl
		public Boolean appendTagsToDocs; // if true (*NOT* default) source tags are appended to the document 
		public StructuredAnalysisConfigPojo.GeoSpecPojo geotag; // Specify a document level geo-tag
	}

Legacy documentation:

TODO

 

Description

You can use document metadata docMetadata to set specific values for a document's metadatadocMetadata has the following parameters

ParameterDescriptionNote
title

The string expression or $SCRIPT(...)

 
descriptionIbid. 
publishedDateIbid. 
fullTextIbid. 
displayUrl

"displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:

  • If it starts with "http://" then it is treated as a web link
  • Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document  - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).
 
appendTagsToDocs  

 

Setting Metadata Values

When structured data is extracted from a source (via the File, Database, or other harvester), each field extracted is captured in the Feed.metadata object. Within the Structured Analysis Harvester data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.

Examples

title, description

The following example can be used to demonstrate how to extract the title, decription, and other parameters from ingested data.

...

Web Feed Example (Twitter)

In the example, the docMetadata object is a complex type with a series of parameters which point to a script "$metadata.json.body"

The script is used to define the following parameters for the document metadata

  • title
  • description
  • fulltext
  • publisheddate

 

Code Block
},        {
            "docMetadata": {
        

...

        "title": "$metadata.json.body",
          

...

  

...

 

...

 

...

 

...

 

...

"description": "$metadata.json.

...

body",
 

...

 

...

 

...

 

...

 

...

 

...

         

...

 

...

"fullText": "$metadata.json.body",
                

...

"publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)",
                

...

"

...

geotag": {
                    "lat": "$SCRIPT( try {

...

return _doc.metadata.json[0].geo.coordinates[0];} catch (err) {return '';})",
                

...

    "lon": "$SCRIPT( try {return _doc.metadata.json[0].geo.coordinates[1];} catch (err) {return '';})"
                

...

}
            }
        

...

},

 

We can see in this example that the script is used to parse the data from the twitter feed, in order to set the metadata values.

Code Block
         ],      

...

"fullText": "$metadata.

...

json.

...

body",

...

 

In the document above you can extract the Offense field using the following syntax:

Code Block
$metadata.offense or ${metadata.offense}

Other fields at the document top level ("$title", "$description", etc) can also be referenced this way

Info

Note: When data is extracted and added to the Metadata object all field name are converted to lowercase.

Info
Note: If the metadata field is an array, the above syntax grabs the first element only. To go deeper into arrays, javascript must be used.

      "script": "function
 getAddressVal( addressStr, number) { try { var addressArray = 
addressStr.split(/ *, */); if (addressArray != null && 
addressArray.length > 0) { if 
(addressArray[number].toLowerCase()=='ny') { return 'new york'; } else 
if (addressArray[number].toLowerCase()=='long island' || 
addressArray[number].toLowerCase()=='li') { return 'medford'; } else { 
return addressArray[number]; } } else { return ''; } } catch (err) { 
return ''; } } function getRegion( code ) { if 
(code.toLowerCase()=='ny') {return 'New York';} else if 
(code.toLowerCase()=='nj') {return 'New Jersey';} else if 
(code.toLowerCase()=='ct') {return 'Connecticut';} else if 
(code.toLowerCase()=='md') {return 'Maryland';} else if 
(code.toLowerCase()=='va') {return 'Virginia';} else if 
(code.toLowerCase()=='pa') {return 'Pennsylvania';} else if 
(code.toLowerCase()=='nj') {return 'New Jersey';} else {return 'New 
York';} }",
      "scriptEngine": "javascript",
      "title": "$metadata.json.body",
      "url": "$metadata.json.link",
      "publishedDate": "$SCRIPT(return _doc.metadata.json[0].postedTime.replace(/.[0-9]{3}Z/,'Z');)"
    },
Info

Note: When iterating over entities or metadata (for either entity or association building), the "$" sign is relative to the iterator, not the document (eg the metadata object being looped over). However when iterating over metadata fields that are strings, then the above document-level referencing is still valid, or "$value"/"${value}" can be used to reference the value itself.

published date

You can use the following example formats, to extract the publishedDate.

Code Block
		if (null == _allowedDatesArray_startsWithLetter) 
		{
			_allowedDatesArray_startsWithLetter = new String[] {
					DateFormatUtils.SMTP_DATETIME_FORMAT.getPattern(),
					
					"MMM d, yyyy hh:mm a",
					"MMM d, yyyy HH:mm",
					"MMM d, yyyy hh:mm:ss a",
					"MMM d, yyyy HH:mm:ss",
					"MMM d, yyyy hh:mm:ss.SS a",
					"MMM d, yyyy HH:mm:ss.SS",
					
					"EEE MMM dd HH:mm:ss zzz yyyy",
					"EEE MMM dd yyyy HH:mm:ss zzz",
					"EEE MMM dd yyyy HH:mm:ss 'GMT'Z (zzz)",					
			};					
			_allowedDatesArray_numeric_1 = new String[] {
					"yyyy-MM-dd'T'HH:mm:ss'Z'",
					DateFormatUtils.ISO_DATE_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATE_TIME_ZONE_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATETIME_FORMAT.getPattern(),
					DateFormatUtils.ISO_DATETIME_TIME_ZONE_FORMAT.getPattern()
			};
			_allowedDatesArray_numeric_2 = new String[] {					
					"yyyyMMdd",
					"yyyyMMdd hh:mm a",
					"yyyyMMdd HH:mm",
					"yyyyMMdd hh:mm:ss a",
					"yyyyMMdd HH:mm:ss",
					"yyyyMMdd hh:mm:ss.SS a",
					"yyyyMMdd HH:mm:ss.SS",
					// Julian, these are unlikely
					"yyyyDDD",
					"yyyyDDD hh:mm a",
					"yyyyDDD HH:mm",
					"yyyyDDD hh:mm:ss a",
					"yyyyDDD HH:mm:ss",
					"yyyyDDD hh:mm:ss.SS a",
					"yyyyDDD HH:mm:ss.SS",
				};
			_allowedDatesArray_stringMonth = new String[] {
					"dd MMM yy",
					"dd MMM yy hh:mm a",
					"dd MMM yy HH:mm",
					"dd MMM yy hh:mm:ss a",
					"dd MMM yy HH:mm:ss",
					"dd MMM yy hh:mm:ss.SS a",
					"dd MMM yy HH:mm:ss.SS",
				};
			_allowedDatesArray_numericMonth = new String[] {
					"MM dd yy",
					"MM dd yy hh:mm a",
					"MM dd yy HH:mm",
					"MM dd yy hh:mm:ss a",
					"MM dd yy HH:mm:ss",
					"MM dd yy hh:mm:ss.SS a",
					"MM dd yy HH:mm:ss.SS",
			};
		}

 

If the date doesn't match one of these formats, add a function along the following lines in the globals script:

Code Block
// substitue YOUR.DATE.FIELD, and the date format
function createPubDate(metadata) {
    var date = metadata.YOUR.DATE.FIELD;
    var parsedDate = new java.text.SimpleDateFormat('MM/dd/yyyy hh:mm:ss a (zzz)').parse(date);
    return '' + parsedDate.toString();
}

and then you can call it from the docMetadata.publishedDate field like:

Code Block
{
	"docMetadata": {
		//...	
		publishedDate: "$SCRIPT( createPubDate(_doc.metadata) );
		//...
	}
}

 

displayUrl

"displayUrl" sets the corresponding document JSON field. It is guaranteed not to be used by the Infinit.e platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI is as follows:

...

 

 

"Office" documents Example

In this example, the subject line of an email correspondence can be extracted by Document metadata and set as the title of the resulting document.

Code Block
 },        {
            "docMetadata": {
                "title": "$SCRIPT( return _doc.metadata._FILE_METADATA_[0].metadata.subject[0];)"
            }
        },

 

In the sample output we can see that title has been set using the docMetadata script, followed by an array of JSON objects.

Code Block
{    "_id": "5048efb0e4b01fd6455420ee",
    "title": "RE: Testing Preschedule workspace",
    "url": "smb://modus:139/enron/testing/semperger-c/deleted_items/37QTKE~3",
    "created": "Sep 6, 2012 06:42:01 PM UTC",
    "modified": "Jul 24, 2012 01:13:02 AM UTC",
    "publishedDate": "Jul 9, 2001 06:33:32 PM UTC",
    "source": [
        "Enron Emails (TextRank)"
    ],
    "sourceKey": [
        "modus.139.enron.testing.."
    ],
    "mediaType": [
        "Email"
    ],
    "description": "I
 am trying to pull it up now, it's taking a long time\r\n\r\n \r\nFrom: 
\tSmith, Will \r\nSent:\tMonday, July 09, 2001 11:28 
AM\r\nTo:\tSemperger, Cara\r\nSubject:\tRE: Testing Preschedule 
workspace\r\n\r\nYes, but Vish made the changes in Table Edit. : - 
)\r\n\r\nWill\r\n\r\n \r\nFrom: \tSemperger, Cara \r\nSent:\tMonday, 
July 09, 2001 1:20 PM\r\nTo:\tSmith, Will\r\nSubject:\tRE: Testing 
Preschedule workspace\r\n\r\nSo, this table edit that Brett is asking me
 to test is really from ",
    "entities": [
        {
            "disambiguated_name": "on- june 18-paloverde-day",
            "index": "on- june 18-paloverde-day/keyword",
            "actual_name": "on- june 18-paloverde-day",
            "type": "Keyword",
            "relevance": 0.10585404743253149,
            "frequency": 1,
            "totalfrequency": 12,
            "doccount": 12,
            "dimension": "What"
        },
        {
            "disambiguated_name": "mulitple times additional data",
            "index": "mulitple times additional data/keyword",
            "actual_name": "mulitple times additional data",
            "type": "Keyword",
            "relevance": 0.18088061045762382,
            "frequency": 1,
            "totalfrequency": 12,
            "doccount": 12,
            "dimension": "What"
        },

 

displayUrl

 

 

IN PROGRESS

Legacy documentation:

...