Overview
Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).
This page has been organized into the following sections for ease of localization:
Format
{ "display": string, "text": [ { "fieldName":string,// One of "fullText", "description", "title" "script":string,// The script/xpath/javascript expression (see scriptlang below) "flags":string, // Standard Java regex field (regex/xpath only), plus "H" to decode HTML "replacement":string, // Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc "scriptlang":string, // One of "javascript", "regex", "xpath" } //.. ] }
Description
Using manual text transformation you can specify the data source for your script to work on. The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.
The following table describes the parameters of the manual text transformation configuration.
Parameter | Description |
---|---|
fieldName | Specifies the data source that the script will execute against "fullText," "description," or "title" |
script | Specify your script |
flags | Standard Java regex field Can have different values, based on See below. |
javascript: There are a few flags that provide additional variables in the javascript:
| |
xpath (and regex, except for "O"):
| |
replacement | If eg. You could find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female. |
scriptlang | Specifies the language of the script that will be provided One of "javascript," "regex," or "xpath" |
Supported Script Languages
You can program manual text extraction using the following supported languages
- Javascript
See detailed example below.
- Regex
See detailed example below.
- Xpath
See detailed example below.
Examples
Javascript
For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.
Log File From File Share
In the following example, manual text transformation is used to parse a log file over the web, with a script
of type javascript.
Globals is used to define a function called "decode," which is then used to capture the metadata for the sample input data in a variable called "info."
Info can be used to capture the metadata for the sample input data as follows:
- info.date
- info.srcIP
- info.dstIP
- info.alert
- info.country
{ "globals": { "scripts": [ "function decode(x)\n{\n var info = {}; \n var rec = x.split(','); \n info.device = rec[0];\n info.date = rec[1];\n info.srcIP = rec[2];\n info.dstIP = rec[3];\n info.alert = rec[4];\n info.country = rec[5];\n return info;\n}" ] } }, { "harvest": { "searchCycle_secs": 3600 } }, { "docMetadata": { "title": "$metadata.info.alert @ $metadata.info.date [$metadata.info.device]: $metadata.info.dstIP -> $metadata.info.srcIP", "publishedDate": "$SCRIPT( return _doc.metadata.info[0].date; )" } }, { "contentMetadata": [ { "fieldName": "info", "script": "var info = decode(text); info;", "scriptlang": "javascript" } ] }
Metadata:
This captured metadata from the sample input data can then be used as output for the script.
], "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaType": ["Log"], "metadata": {"info": [{ "alert": "DUMMY_ALERT_TYPE_1 ", "country": "United States", "date": "2012-01-01T13:43:00", "device": "SCANNER_1 ", "dstIP": "66.66.66.66", "srcIP": " 10.0.0.1" }]},
Javascript can also return more complex objects, arrays of objects, or array of primitives.
Regex
Log File
Source:
Consider the following alarm logs which include a record of device alerts, including their network and physical locations.
Date,Device,SrcIP,dstIP,Alert,Country SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States SCANNER_2,2012-02-01T14:21:00,SCANNER_2,10.0.0.2,66.66.66.66,DUMMY_ALERT_TYPE_2,United Kingdom SCANNER_3,2012-03-01T15:17:00,10.0.0.1,99.66.99.66,DUMMY_ALERT_TYPE_3,Netherlands
Source Configuration:
In the source configuration, a regex script is used to extract data to make up the "fullText" and "description" of the resulting document.
}, { "text": [ { "fieldName": "fullText", "script": ",", "scriptlang": "regex", "flags": "md", "replacement": " , " }, { "fieldName": "description", "script": ",", "scriptlang": "regex", "flags": "md", "replacement": " , " } ] },
Output:
.The example output includes the "fullText" which results from the regex script.
} ], "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaType": ["Log"], "metadata": {"info": [{ "alert": "DUMMY_ALERT_TYPE_1 ", "country": "United States", "date": "2012-01-01T13:43:00", "device": "SCANNER_1 ", "dstIP": "66.66.66.66", "srcIP": " 10.0.0.1" }]}, "modified": "Jun 4, 2013 12:54:34 AM UTC", "publishedDate": "January 1, 2012 13:43:00 PM UTC", "source": ["Cyber Logs Test"], "sourceKey": ["INFINITE_ENDPOINT.api.share.get.51ad28a440b4a4f0f757824c.25.26"], "tags": [ "cyber", "structured" ], "title": "DUMMY_ALERT_TYPE_1 @ 2012-01-01T13:43:00 [SCANNER_1 ]: 66.66.66.66 -> 10.0.0.1", "url": "http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c#1" }
Xpath
Neither regex nor javascript are well suited for extracting fields from HTML and XML.
As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).
In this example, an Xpath script is used as part of manual text extraction, in order to convert a sample XML document into JSON.
XML
Source Input:
Consider the following xml file, which includes a price list for several food items.
<?xml version="1.0" encoding="UTF-8"?> <breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description>two of our famous Belgian Waffles with plenty of real maple syrup</description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description>light Belgian waffles covered with strawberries and whipped cream</description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description> <calories>900</calories> </food> <food> <name>French Toast</name> <price>$4.50</price> <description>thick slices made from our homemade sourdough bread</description> <calories>600</calories> </food> <food> <name>Homestyle Breakfast</name> <price>$6.95</price> <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description> <calories>950</calories> </food> </breakfast_menu>
Source Configuration:
In the source configuration example below, a xpath script is specified to perform the JSON conversion.
{ "links": { "extraMeta": [ { "context": "First", "fieldName": "convert_to_json", "flags": "o", "script": "//breakfast_menu/food[*]", "scriptlang": "xpath" } ], "script": "function convert_to_docs(jsonarray, url)\n{\n var docs = [];\n for (var docIt in jsonarray) {\n var predoc = jsonarray[docIt];\n delete predoc.content;\n var doc = {};\n doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n doc.fullText = predoc;\n doc.title = \"TBD\";\n doc.description = \"TBD\";\n docs.push(doc);\n }\n return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;", "scriptflags": "d" }
Output:
The output returns an array of JSON formatted responses:
{ "communityId": ["4d38b72c054548f038a0414a"], "created": "Jun 5, 2013 09:12:15 PM UTC", "description": "TBD", "fullText": "{ \"calories\" : \"650\" , \"description\" : \"two of our famous Belgian Waffles with plenty of real maple syrup\" , \"price\" : \"$5.95\" , \"name\" : \"Belgian Waffles\"}", "mediaType": ["News"], "metadata": {"json": [{ "calories": "650", "description": "two of our famous Belgian Waffles with plenty of real maple syrup", "name": "Belgian Waffles", "price": "$5.95" }]}, "modified": "Jun 5, 2013 09:12:15 PM UTC", "publishedDate": "Jun 5, 2013 09:12:15 PM UTC", "source": ["aaa xml test"], "sourceKey": ["www.w3schools.com.xml.simple.xml"], "tags": ["tag1"], "title": "TBD", "url": "http://www.w3schools.com/xml/simple.xml#0" } { "communityId": ["4d38b72c054548f038a0414a"], "created": "Jun 5, 2013 09:12:15 PM UTC", "description": "TBD", "fullText": "{ \"calories\" : \"900\" , \"description\" : \"light Belgian waffles covered with strawberries and whipped cream\" , \"price\" : \"$7.95\" , \"name\" : \"Strawberry Belgian Waffles\"}", "mediaType": ["News"], "metadata": {"json": [{ "calories": "900", "description": "light Belgian waffles covered with strawberries and whipped cream", "name": "Strawberry Belgian Waffles", "price": "$7.95" }]}, "modified": "Jun 5, 2013 09:12:15 PM UTC", "publishedDate": "Jun 5, 2013 09:12:15 PM UTC", "source": ["aaa xml test"], "sourceKey": ["www.w3schools.com.xml.simple.xml"], "tags": ["tag1"], "title": "TBD", "url": "http://www.w3schools.com/xml/simple.xml#1" }
Footnotes:
Legacy documentation:
Legacy documentation:
- See under "simpleTextCleanser object"
- (note headers and footers are no longer supported - you can just do this manually now)