Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).

TODO

Table of Contents

Format

TODO convert to JSON

Code Block

{
	"display": string,
	"text": [
	{} // see ManualTextExtractionSpecPojo below
	]
}
//////////////////////////////////
	public static class ManualTextExtractionSpecPojo {
		public String fieldName; // One of "fullText", "description", "title"
		public String script; // The script/xpath/javascript expression (see scriptlang below)
		public String flags; // Standard Java regex field (regex/xpath only), plus "H" to decode HTML
		public String replacement; // Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc
		public String scriptlang; // One of "javascript", "regex", "xpath"
	}

Legacy documentation:

See under "simpleTextCleanser object"
- (note headers and footers are no longer supported - you can just do this manually now)

...

Description

Using manual text transformation you can specify the data source for your script to work on. The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.

...

Parameter	Description	Note	Data Type
`fieldName`	Specifies the data source that the script will execute against "fullText," "description," or "title"
`script`	Specify your script
`flags`	Standard Java regex field Can have different values, based on `scriptlang` See below.
	javascript: There are a few flags that provide additional variables in the javascript: "m" to get "_doc.metadata", written into the variable "_metadata" (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field) "d" to get "_doc", written into the variable "_doc", "t" to return the full text of the document into "text". If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.
	xpath (and regex, except for "O"): 'H': will HTML-decode resulting fields. (Eg "&" -> "&") 'o': if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1) 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields) 'D': described above 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
`replacement`	If `scriptlang` is regex or xpath, `replacement` can be used to replace the value indicated in the regex/xpath. eg. You could find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.
`scriptlang`	Specifies the language of the script that will be provided One of "javascript," "regex," or "xpath"

Supported Script Languages

You can program manual text extraction using the following supported langugaes

javascript
regex
xpath

...

javascript

For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.

...

Code Block

 ],    "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States",
    "mediaType": ["Log"],
    "metadata": {"info": [{
        "alert": "DUMMY_ALERT_TYPE_1 ",
        "country": "United States",
        "date": "2012-01-01T13:43:00",
        "device": "SCANNER_1 ",
        "dstIP": "66.66.66.66",
        "srcIP": " 10.0.0.1"
    }]},

...

Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.

...

Regex

xml

The following example shows how a regex script can be used to manually parse the text of the ingested data

...

Code Block
}], "multipledays": ["No"], "organization": ["No group"], "perpetrator": [{ "characteristic": "Islamic Extremist (Sunni)", "nationality": "Unknown" }],

...

Xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

...

Versions Compared

Old Version 4

New Version 5

Key

Format

Description

Supported Script Languages

javascript

Regex

xml

Xpath

Page Comparison

Versions Compared

Old Version 4

New Version 5

Key

Format

Description

Supported Script Languages

javascript

Regex

xml

Xpath