JSON format

Note that there is a separate overview of using the Unstructured Analysis Harvester. This page is reference information.

The UnstructuredAnalysis object of the Source document:

{
   "script" : "string", // OPTIONAL: String, can contain one or more JavaScript functions execured globally and then referenced from meta or simpleTextCleanser,
                // i.e. "function func() { var foo = 'test'; return foo; }"

   "simpleTextCleanser": [ { ... } ], //complex object (see below) transforming the text before metadata/entity extraction
 
   "meta" : [ { ... } ] //definition for objects to parse and place into the source metadata (see below)
 
   "caches": { "string": "string", ... } // A list of caches in the format <CACHE_NAME>:<ID> where <ID> is the "_id" of a JSON share, see overview

   "headerRegEx" : "string", //regular expression string that represents the header of the document, see overview
   "footerRegEx" : "string", //regular expression string that represents the header of the document, see overview
}
simpleTextCleanser object
{
   "field": "string", // The document field to transform, one of: "title", "description", "fullText", "metdata.<field>"
   "script": "string", // The regex to match the section of text to be modified (other scripts may be supported in the future)
   "scriptlang": "string", // Optional and currently unusued, defaults to regex
   "replacement": "string", // The replacement string, can include replacement groups in standard Java syntax
   "flags": "string" // The fields (standard Posix/Java, eg "i" for case-insensitive), *** with additions: 'H' HTML-decodes the resulting string
} 

Note that the order of "text cleansing" vs all other operations is as follows:

Meta object
{
   "fieldName" : "string", //name used to define the metadata (see metadata link below)
   "context" : "string", // one of "First", "All", "Body", "Header", "Footer": used to specify location the meta data should be found
 
   "script" : "string", // either javascript or regular expression (based on "scriptlang") used to generate the metadata
   "scriptlang": "string", // "javascript", "regex", or "xpath"
 
   "replace" : "string", // when set, regular expression matches will be replaced with this string (can include groups, eg $1 or group 1, note $0 is the whole string)
   "groupNum" : int, // A quick alternative to using replace, eg "groupNum":2 is equivalent is "replace": "$2". "groupNum" has a special meaning for xpath, see below.
   "flags" : "string" // for regexes, the flags to apply (standard Posix/Java flags: "midun"). "flags" also has meanings for "javascript" and "xpath", and "xpath"/"regex", see below
}

The metadata object is described here. In the javascript case, if an array of objects are returned, that array is embedded into the "metadata" map; if a single object is returned. it is embedded inside a single-element array (ie consistent with how all metadata objects are treated).

As described above, the regex/javascript is applied before text cleansing if the context is "First", faster otherwise.

Examples are provided in the overview of using the Unstructured Analysis Harvester.

By default XPath and Regex fields are deduplicated, ie if the string "apple" is found twice for the same field name, then it is not added to the field array. In cases where multiple fields are being correlated based on index, this is obviously not desirable, and it can be turned off by setting the flag 'U' ("u" for "Unique", capitalization in flags typically denotes negation).

Javascript "meta" fields

If no flags are specified for a javascript "meta" field, then the following objects are available in the javascript:

If the flags string is specified, it is a character sequence, with characters interpreted as follows:

XPath "meta" fields

The following flags are supported for XPath (and regex, except for "O"):

"groupNum"/"replace" are used for 2 purposes:

XPath support is discussed further here.