UnstructuredAnalysis object

JSON format

Note that there is a separate overview of using the Unstructured Analysis Harvester. This page is reference information.

The UnstructuredAnalysis object of the Source document:

Source.unstructuredAnalysis object
{
   "script" : "string", // OPTIONAL: String, can contain one or more JavaScript functions execured globally and then referenced from meta or simpleTextCleanser,
                // i.e. "function func() { var foo = 'test'; return foo; }"

   "simpleTextCleanser": [ { ... } ], //complex object (see below) transforming the text before metadata/entity extraction
 
   "meta" : [ { ... } ] //definition for objects to parse and place into the source metadata (see below)
 
   "caches": { "string": "string", ... } // A list of caches in the format <CACHE_NAME>:<ID> where <ID> is the "_id" of a JSON share, see overview

   "headerRegEx" : "string", //regular expression string that represents the header of the document, see overview
   "footerRegEx" : "string", //regular expression string that represents the header of the document, see overview
}
simpleTextCleanser object
simpleTextCleanser specification
{
   "field": "string", // The document field to transform, one of: "title", "description", "fullText", "metdata.<field>"
   "script": "string", // The regex to match the section of text to be modified (other scripts may be supported in the future)
   "scriptlang": "string", // Optional and currently unusued, defaults to regex
   "replacement": "string", // The replacement string, can include replacement groups in standard Java syntax
   "flags": "string" // The fields (standard Posix/Java, eg "i" for case-insensitive), *** with additions: 'H' HTML-decodes the resulting string
} 

Note that the order of "text cleansing" vs all other operations is as follows:

  • For RSS objects, get the raw content via HTTP
  • Identification of header, footer, body on raw content
  • Processing of "meta" objects (see below) with "First" contexts
  • All text cleansing operations
  • Processing of "metadata" objects (see below) with all other contexts ("Body", "Header", "Footer", "All")
Meta object
Source.unstructuredAnalysis.meta object
{
   "fieldName" : "string", //name used to define the metadata (see metadata link below)
   "context" : "string", // one of "First", "All", "Body", "Header", "Footer": used to specify location the meta data should be found
 
   "script" : "string", // either javascript or regular expression (based on "scriptlang") used to generate the metadata
   "scriptlang": "string", // "javascript", "regex", or "xpath"
 
   "replace" : "string", // when set, regular expression matches will be replaced with this string (can include groups, eg $1 or group 1, note $0 is the whole string)
   "groupNum" : int, // A quick alternative to using replace, eg "groupNum":2 is equivalent is "replace": "$2". "groupNum" has a special meaning for xpath, see below.
   "flags" : "string" // for regexes, the flags to apply (standard Posix/Java flags: "midun"). "flags" also has meanings for "javascript" and "xpath", and "xpath"/"regex", see below
}

The metadata object is described here. In the javascript case, if an array of objects are returned, that array is embedded into the "metadata" map; if a single object is returned. it is embedded inside a single-element array (ie consistent with how all metadata objects are treated).

As described above, the regex/javascript is applied before text cleansing if the context is "First", faster otherwise.

Examples are provided in the overview of using the Unstructured Analysis Harvester.

By default XPath and Regex fields are deduplicated, ie if the string "apple" is found twice for the same field name, then it is not added to the field array. In cases where multiple fields are being correlated based on index, this is obviously not desirable, and it can be turned off by setting the flag 'U' ("u" for "Unique", capitalization in flags typically denotes negation).

Javascript "meta" fields

If no flags are specified for a javascript "meta" field, then the following objects are available in the javascript:

  • text: The full text of the document
  • _iterator: An array consisting of the current value of that metadata field (null if the field does not currently exist). This object allow successive scripts with the same "fieldName" to perform a processing chain. It also allows the Unstructured Analysis Harvester to modify document-level metadata (eg from the Feed Harvester, File Harvester, or Database Harvester).

If the flags string is specified, it is a character sequence, with characters interpreted as follows:

  • 't': provides the full text of the document to the script, in the field "text"
  • 'd': provides the entire document object to the script, int he field "_doc"
  • 'm': provides the entire metadata object ("_doc.metadata") to the script, in the field "_metadata".
XPath "meta" fields

The following flags are supported for XPath (and regex, except for "O"):

  • 'H': will HTML-decode resulting fields. (Eg "&amp;" -> "&")
  • 'o': if  the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
  • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
  • 'D': described above 
  • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata) 

"groupNum"/"replace" are used for 2 purposes:

  • If the "regex" extension is used (see below for explanatory link), then groupNum/replace are used to select the capturing group just like for a normal regex "meta" field.
  • If "groupNum" is set to -1, and the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array.
    • (Note this is deprecated, it is now recommended to use flags: 'o')

XPath support is discussed further here.