...
Code Block |
---|
{ "display": string, "contentMetadata": [{ "fieldName":string,// Any string, the key for generated array in "doc.metadata" "scriptlang":string,// One of "javascript", "regex", "xpath", "stream" "script":string,// The script that will generate the array in "doc.metadata" (under fieldName) "flags":flags,// Standard Java regex field (regex/xpath only), plus "H" to decode HTML, "D": will allow duplicate strings (by default they are de-duplicated), plus the following custom flags: // For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata (_metadata) // For xpath/strean: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1) "replace":string,// Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc "store":Boolean,// Whether this field should be stored in the DB or discarded after the harvest processing "index":Boolean,// Whether this field should be full-text indexed or just stored in the DB }, ... ] } |
Description
The following table describes the parameters of the content metadata configuration.
Parameter | Description | ||
---|---|---|---|
fieldName | Any string, the key for generated array in "doc.metadata" | ||
scriptlang | javascript, regex or xpath. | ||
script | script that will generate the array. in "doc.metadata" | ||
scriptlang | javascript, regex, xpath, or stream. "stream" provides an efficient parser for splitting up potentially large XML/JSON objects into lots of smaller objects, either return chunks as text or as metadata (if flags is set to "o") | ||
script | script that will generate the array - eg the regular expression, xpath string, or JS script.
| ||
flags | For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata There are also a few flags that provide additional variables in the javascript:
| ||
For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. For reference, here is the complete set of flags for xpath (and regex, except for "O"):
| |||
replace | Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc. | ||
store | Whether this field should be stored in the DB or discarded after the harvest processing. | ||
index | Whether this field should be full-text indexed or just stored in the DB. |
...
Code Block |
---|
}, { "entities": [ { "creationCriteriaScript": "$FUNC( isOrganizationSpecified(); )", "dimension": "Who", "disambiguated_name": "$metadata.organization", "type": "Organization", "useDocGeo": false }, |
...
...
Xpath
Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).
...