Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
{
	"display": string,
	"contentMetadata": [{
		"fieldName":string,// Any string, the key for generated array in "doc.metadata"
		"scriptlang":string,// One of "javascript", "regex", "xpath", "stream"
		"script":string,// The script that will generate the array in "doc.metadata" (under fieldName)
		"flags":flags,// Standard Java regex field (regex/xpath only), plus "H" to decode HTML, "D": will allow duplicate strings (by default they are de-duplicated), plus the following custom flags:
								// For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata (_metadata)
								// For xpath/strean: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
		"replace":string,// Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc
		"store":Boolean,// Whether this field should be stored in the DB or discarded after the harvest processing
		"index":Boolean,// Whether this field should be full-text indexed or just stored in the DB
	},
	...
	]
}

...

ParameterDescription
fieldName
Any string, the key for generated array in "doc.metadata"
scriptlang

javascript, regex, xpath, or xpathstream.TODO: stream

"stream" provides an efficient parser for splitting up potentially large XML/JSON objects into lots of smaller objects, either return chunks as text or as metadata (if flags is set to "o")

script

script that will generate the array - eg the regular expression, xpath string, or JS script.

Warning

Note if using xpath - the document is converted to a valid HTML document, ie "html" and "body" nodes are the outer nodes. Therefore even if processing raw XML then the script either needs to start "/html/body" (for the root XML node) or "//" to get any nodes matching the subsequent expression.

If using stream: the script can be "" or null, in which case the top object is parsed. Otherwise a comma-separated list of (top-level) fields is provided, and each object with that name or in an array of that name is selected (just like with the File extractor)

flags

For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata
 

There are also a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". 
    • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.
  • 'H': HTML code the resulting field (same as for regex/xpath except that it only applied to single value returns. To decode HTML in more complex expressions/arrays, you can always call org.apache.commons.lang.StringEscapeUtils.unescapeHtml('' + js_string)

For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array.

For reference, here is the complete set of flags for xpath (and regex, except for "O"):

  • 'H': will HTML-decode resulting fields. (Eg "&" -> "&")
  • 'o': if  the stream/XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
  • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
  • 'D': described above 
  • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
replace

Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc.

store

Whether this field should be stored in the DB or discarded after the harvest processing.

index

Whether this field should be full-text indexed or just stored in the DB.

...