JSON format
Note that there is a separate overview of using the Unstructured Analysis Harvester. This page is reference information.
The UnstructuredAnalysis object of the Source document:
Code Block | ||
---|---|---|
| ||
{ "script" : "string", // OPTIONAL: String, can contain one or more JavaScript functions execured globally and then referenced from meta or simpleTextCleanser, // i.e. "function func() { var foo = 'test'; return foo; }" "simpleTextCleanser": [ { ... } ], //complex object (see below) transforming the text before metadata/entity extraction "meta" : [ { ... } ] //definition for objects to parse and place into the source metadata (see below) "caches": { "string": "string", ... } // A list of caches in the format <CACHE_NAME>:<ID> where <ID> is the "_id" of a JSON share, see overview "headerRegEx" : "string", //regular expression string that represents the header of the document, see overview "footerRegEx" : "string", //regular expression string that represents the header of the document, see overview } |
...
- 't': provides the full text of the document to the script, in the field "text"
- 'd': provides the entire document object to the script, int he field "_doc"
- 'm': provides the entire metadata object ("_doc.metadata") to the script, in the field "_metadata".
XPath "meta" fields
The following flags are supported for XPath (and regex, except for "O"):
- 'H': will HTML-decode resulting fields. (Eg "&" -> "&")
- 'o': if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
- 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
- 'D': described above
- 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
...