Manual text transformation
Format
{ "display": string, "text": [ { "fieldName":string,// One of "fullText", "description", "title" "script":string,// The script/xpath/javascript expression (see scriptlang below) "flags":string, // Standard Java regex field (regex/xpath only), plus "H" to decode HTML "replacement":string, // Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc "scriptlang":string, // One of "javascript", "regex", "xpath" } //.. ] }
Examples
Javascript
For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.
Log File From File Share
In the following example, manual text transformation is used to parse a log file over the web, with a script
of type javascript.
Globals is used to define a function called "decode," which is then used to capture the metadata for the sample input data in a variable called "info."
Info can be used to capture the metadata for the sample input data as follows:
- info.date
- info.srcIP
- info.dstIP
- info.alert
- info.country
{ "globals": { "scripts": [ "function decode(x)\n{\n var info = {}; \n var rec = x.split(','); \n info.device = rec[0];\n info.date = rec[1];\n info.srcIP = rec[2];\n info.dstIP = rec[3];\n info.alert = rec[4];\n info.country = rec[5];\n return info;\n}" ] } }, { "harvest": { "searchCycle_secs": 3600 } }, { "docMetadata": { "title": "$metadata.info.alert @ $metadata.info.date [$metadata.info.device]: $metadata.info.dstIP -> $metadata.info.srcIP", "publishedDate": "$SCRIPT( return _doc.metadata.info[0].date; )" } }, { "contentMetadata": [ { "fieldName": "info", "script": "var info = decode(text); info;", "scriptlang": "javascript" } ] }
Metadata:
This captured metadata from the sample input data can then be used as output for the script.
], "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaType": ["Log"], "metadata": {"info": [{ "alert": "DUMMY_ALERT_TYPE_1 ", "country": "United States", "date": "2012-01-01T13:43:00", "device": "SCANNER_1 ", "dstIP": "66.66.66.66", "srcIP": " 10.0.0.1" }]},
Javascript can also return more complex objects, arrays of objects, or array of primitives.
Regex
Log File
Source:
Consider the following alarm logs which include a record of device alerts, including their network and physical locations.
Date,Device,SrcIP,dstIP,Alert,Country SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States SCANNER_2,2012-02-01T14:21:00,SCANNER_2,10.0.0.2,66.66.66.66,DUMMY_ALERT_TYPE_2,United Kingdom SCANNER_3,2012-03-01T15:17:00,10.0.0.1,99.66.99.66,DUMMY_ALERT_TYPE_3,Netherlands
Source Configuration:
In the source configuration, a regex script is used to extract data to make up the "fullText" and "description" of the resulting document.
}, { "text": [ { "fieldName": "fullText", "script": ",", "scriptlang": "regex", "flags": "md", "replacement": " , " }, { "fieldName": "description", "script": ",", "scriptlang": "regex", "flags": "md", "replacement": " , " } ] },
Output:
.The example output includes the "fullText" which results from the regex script.
} ], "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaType": ["Log"], "metadata": {"info": [{ "alert": "DUMMY_ALERT_TYPE_1 ", "country": "United States", "date": "2012-01-01T13:43:00", "device": "SCANNER_1 ", "dstIP": "66.66.66.66", "srcIP": " 10.0.0.1" }]}, "modified": "Jun 4, 2013 12:54:34 AM UTC", "publishedDate": "January 1, 2012 13:43:00 PM UTC", "source": ["Cyber Logs Test"], "sourceKey": ["INFINITE_ENDPOINT.api.share.get.51ad28a440b4a4f0f757824c.25.26"], "tags": [ "cyber", "structured" ], "title": "DUMMY_ALERT_TYPE_1 @ 2012-01-01T13:43:00 [SCANNER_1 ]: 66.66.66.66 -> 10.0.0.1", "url": "http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c#1" }
Xpath
Neither regex nor javascript are well suited for extracting fields from HTML and XML.
As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).
In this example, an Xpath script is used as part of manual text extraction, in order to convert a sample XML document into JSON.
XML
Source Input:
Consider the following xml file, which includes a price list for several food items.
<?xml version="1.0" encoding="UTF-8"?> <breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description>two of our famous Belgian Waffles with plenty of real maple syrup</description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description>light Belgian waffles covered with strawberries and whipped cream</description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description> <calories>900</calories> </food> <food> <name>French Toast</name> <price>$4.50</price> <description>thick slices made from our homemade sourdough bread</description> <calories>600</calories> </food> <food> <name>Homestyle Breakfast</name> <price>$6.95</price> <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description> <calories>950</calories> </food> </breakfast_menu>
Source Configuration:
In the source configuration example below, a xpath script is specified to perform the JSON conversion.
{ "links": { "extraMeta": [ { "context": "First", "fieldName": "convert_to_json", "flags": "o", "script": "//breakfast_menu/food[*]", "scriptlang": "xpath" } ], "script": "function convert_to_docs(jsonarray, url)\n{\n var docs = [];\n for (var docIt in jsonarray) {\n var predoc = jsonarray[docIt];\n delete predoc.content;\n var doc = {};\n doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n doc.fullText = predoc;\n doc.title = \"TBD\";\n doc.description = \"TBD\";\n docs.push(doc);\n }\n return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;", "scriptflags": "d" }
Output:
The output returns an array of JSON formatted responses:
{ "communityId": ["4d38b72c054548f038a0414a"], "created": "Jun 5, 2013 09:12:15 PM UTC", "description": "TBD", "fullText": "{ \"calories\" : \"650\" , \"description\" : \"two of our famous Belgian Waffles with plenty of real maple syrup\" , \"price\" : \"$5.95\" , \"name\" : \"Belgian Waffles\"}", "mediaType": ["News"], "metadata": {"json": [{ "calories": "650", "description": "two of our famous Belgian Waffles with plenty of real maple syrup", "name": "Belgian Waffles", "price": "$5.95" }]}, "modified": "Jun 5, 2013 09:12:15 PM UTC", "publishedDate": "Jun 5, 2013 09:12:15 PM UTC", "source": ["aaa xml test"], "sourceKey": ["www.w3schools.com.xml.simple.xml"], "tags": ["tag1"], "title": "TBD", "url": "http://www.w3schools.com/xml/simple.xml#0" } { "communityId": ["4d38b72c054548f038a0414a"], "created": "Jun 5, 2013 09:12:15 PM UTC", "description": "TBD", "fullText": "{ \"calories\" : \"900\" , \"description\" : \"light Belgian waffles covered with strawberries and whipped cream\" , \"price\" : \"$7.95\" , \"name\" : \"Strawberry Belgian Waffles\"}", "mediaType": ["News"], "metadata": {"json": [{ "calories": "900", "description": "light Belgian waffles covered with strawberries and whipped cream", "name": "Strawberry Belgian Waffles", "price": "$7.95" }]}, "modified": "Jun 5, 2013 09:12:15 PM UTC", "publishedDate": "Jun 5, 2013 09:12:15 PM UTC", "source": ["aaa xml test"], "sourceKey": ["www.w3schools.com.xml.simple.xml"], "tags": ["tag1"], "title": "TBD", "url": "http://www.w3schools.com/xml/simple.xml#1" }
Footnotes:
Legacy documentation:
Legacy documentation:
- See under "simpleTextCleanser object"
- (note headers and footers are no longer supported - you can just do this manually now)