Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

JSON format

Note that there is a separate overview of using the Unstructured Analysis Harvester. This page is reference information.

The UnstructuredAnalysis object of the Source document:

Code Block
titleSource.unstructuredAnalysis object
{
   "script" : "string", // OPTIONAL: String, can contain one or more JavaScript functions execured globally and then referenced from meta or simpleTextCleanser,
                // i.e. "function func() { var foo = 'test'; return foo; }"

   "simpleTextCleanser": [ { ... } ], //complex object (see below) transforming the text before metadata/entity extraction
 
   "meta" : [ { ... } ] //definition for objects to parse and place into the source metadata (see below)
 
   "caches": { "string": "string", ... } // A list of caches in the format <CACHE_NAME>:<ID> where <ID> is the "_id" of a JSON share, see overview

   "headerRegEx" : "string", //regular expression string that represents the header of the document, see overview
   "footerRegEx" : "string", //regular expression string that represents the header of the document, see overview
}

...

  • 't': provides the full text of the document to the script, in the field "text"
  • 'd': provides the entire document object to the script, int he field "_doc"
  • 'm': provides the entire metadata object ("_doc.metadata") to the script, in the field "_metadata".
XPath "meta" fields

The following flags are supported for XPath (and regex, except for "O"):

  • 'H': will HTML-decode resulting fields. (Eg "&amp;" -> "&")
  • 'o': if  the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
  • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
  • 'D': described above 
  • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata) 

...