...
Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).
TODO
Table of Contents |
---|
Format
TODO convert to JSON
Code Block |
---|
{ "display": string, "text": [ {} // see ManualTextExtractionSpecPojo below ] } ////////////////////////////////// public static class ManualTextExtractionSpecPojo { public String fieldName; // One of "fullText", "description", "title" public String script; // The script/xpath/javascript expression (see scriptlang below) public String flags; // Standard Java regex field (regex/xpath only), plus "H" to decode HTML public String replacement; // Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc public String scriptlang; // One of "javascript", "regex", "xpath" } |
Legacy documentation:
- See under "simpleTextCleanser object"
- (note headers and footers are no longer supported - you can just do this manually now)
...
Description
Using manual text transformation you can specify the data source for your script to work on. The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.
...
Parameter | Description | Note | Data Type |
---|---|---|---|
fieldName | Specifies the data source that the script will execute against "fullText," "description," or "title" | ||
script | Specify your script | ||
flags | Standard Java regex field Can have different values, based on See below. | ||
javascript: There are a few flags that provide additional variables in the javascript:
| |||
xpath (and regex, except for "O"):
| |||
replacement | If eg. You could find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female. | ||
scriptlang | Specifies the language of the script that will be provided One of "javascript," "regex," or "xpath" |
Supported Script Languages
You can program manual text extraction using the following supported langugaes
- javascript
- regex
- xpath
...
javascript
For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.
...
Code Block |
---|
], "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaType": ["Log"], "metadata": {"info": [{ "alert": "DUMMY_ALERT_TYPE_1 ", "country": "United States", "date": "2012-01-01T13:43:00", "device": "SCANNER_1 ", "dstIP": "66.66.66.66", "srcIP": " 10.0.0.1" }]}, |
...
Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.
...
Regex
xml
The following example shows how a regex script can be used to manually parse the text of the ingested data
...
Code Block |
---|
}], "multipledays": ["No"], "organization": ["No group"], "perpetrator": [{ "characteristic": "Islamic Extremist (Sunni)", "nationality": "Unknown" }], |
...
Xpath
Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).
...