Overview
Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).
TODO
Format
Code Block |
---|
TODO |
Legacy documentation:
...
TODO convert to JSON
Code Block |
---|
{
"display": string,
"text": [
{} // see ManualTextExtractionSpecPojo below
]
}
//////////////////////////////////
public static class ManualTextExtractionSpecPojo {
public String fieldName; // One of "fullText", "description", "title"
public String script; // The script/xpath/javascript expression (see scriptlang below)
public String flags; // Standard Java regex field (regex/xpath only), plus "H" to decode HTML
public String replacement; // Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc
public String scriptlang; // One of "javascript", "regex", "xpath"
}
|
Legacy documentation:
- See under "simpleTextCleanser object"
- (note headers and footers are no longer supported - you can just do this manually now)
TODO
Description
Legacy documentation:
TODO
Examples
TODO