Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).

...

Panel

In this section:

Table of Contents
maxLevel2
indent16px

Format

TODO convert to JSON

Code Block
{
	"display": string,
	"text": [
	{

...


		"fieldName":string,// 

...

One 

...

of "fullText", "description", "title"
		"script":string,// The script/xpath/javascript expression (see scriptlang below)
		"flags":string, // Standard Java regex field (regex/xpath only), plus "H" to decode HTML
		

...

"replacement":string, // Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc
		

...

"scriptlang":string, // One of "javascript", "regex", "xpath"
	}

...

Legacy documentation:

TODO

	//..
	]
} 

Description

Using manual text transformation you can specify the data source for your script to work on.  The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.

The following

...

table describes the parameters of the manual text transformation configuration.

ParameterDescription

...

fieldName

Specifies the data source that the script will execute against

"fullText," "description," or "title"

...

scriptSpecify your script

...

flags

Standard Java regex field

Can have different values, based on scriptlang

See below.

...

javascript:

There are  a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". 
    • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.

 

 

...

xpath (and regex, except for "O"):

  • 'H': will HTML-decode resulting fields. (Eg "&" -> "&")
  • 'o': if  the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
  • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
  • 'D': described above 
  • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)

...

 

...

 

replacement

If scriptlang is regex or xpath, replacement can be used to replace the value indicated in the regex/xpath.

eg. You could find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.

...

scriptlang

Specifies the language of the script that will be provided

One of "javascript," "regex," or "xpath"

...

 

Supported Script Languages

You can program manual text extraction using the following supported

...

languages

  • Javascript

See detailed example below.

  • Regex

See detailed example below.

  • Xpath

See detailed example below.

 

javascript

...

Examples

Anchor
java
java
Javascript

For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.

...

By default, only one input variable is included: "text", which corresponds to the "fullText" field of the document JSON.

Info

If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result.

Examples

For example, consider the following javascript, which (like the regex example above) pulls the address out of the example letter format.

Code Block
languagejavascript
titleSimple javascript to be embedded in "meta" object
var i = text.indexOf("address:"); 
var j = text.indexOf("\n", i); // (starts looking after address)
var returnVal = null;
if (i >= 0 && j >= 0) {
   returnVal = text.substring(i, j).trim();
}
returnVal; 
Info

Note the slightly unusual way in which the object/primitive is "returned": whatever is evaluated on the final line. The easiest way of managing this is to have a single standalone line containing a previously-declared "var" at the end.

Then this would be embedded as follows in a "meta" object:

...

titleSource.unstructuredAnalysis object

...

Log File From File Share

In the following example, manual text transformation is used to parse a log file over the web, with a script of type javascript.

Globals is used to define a function called "decode," which is then used to capture the metadata for the sample input data in a variable called "info."

Info can be used to capture the metadata for the sample input data as follows:

  • info.date
  • info.srcIP
  • info.dstIP
  • info.alert
  • info.country
Code Block
{
            "globals": {
                "scripts": [
                    "function decode(x)\n{\n    var info = {};   \n    var rec = x.split(',');   \n    info.device = rec[0];\n    info.date = rec[1];\n    info.srcIP = rec[2];\n    info.dstIP = rec[3];\n    info.alert = rec[4];\n    info.country = rec[5];\n    return info;\n}"
              

...

 

...

 ]
      

...

 

...

     }
  

...

 

...

     },
  

...

 

...

 

...

  

...

 

...

Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.

Note that using "\n"s in the embedded script is recommended, since then runtime javascript errors (reported in the "harvest.harvest_message" field of the source object) will map the line number.

Regex

The regular expression used to find the data labeled by fieldName is placed in the script string. This regular expression makes use of groups, specified by groupNum. A group is a pair of parentheses used to group subpatterns.

Examples

For example, h(a|i)t matches hat or hit. A group also captures the matching text within the parentheses. For example:

...

{
   input:   abbc
   pattern: a(b*)c
}

causes the substring bb to be captured by the group (b*). If the use of groups is not desired, groupNum should be set to the number 0 (zero), ie to get the entirety of the matching pattern.

In the case that the desired purpose of the regular express is to do a replace, this replace string can be specified in replace.  For example,

Code Block
titleSource.unstructuredAnalysis.meta object
{
   "fieldName" : "Race",
   "context" : "All",
   "regEx" : "C/[F|M]",
   "groupNum" : 0,
   "replace" : "Caucasian"
}

would find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.

Other than a standard set of POSIX fiags ("midun"), there are some additional, infinit.e-specific, regex fields which are described under XPath, see below.

xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).

Examples

Consider the Following Examples:

Code Block
languagehtml/xml
<html>
	<body>
		 <b>Check out this really great site for News &amp; more!</b>
		 <a href="http://www.bbc.com">BBC</a>
		<i>List of my favorite topics</i>
		<ul id="favTopics">
			<li>Sport</li>
			<li>TV</li>
		</ul>
		<i>List of my not-so favorite topics</i>
		<ul class="ugly">
			<li>The Topic of Radio</li>
			<li>The Topic of News</li>
		</ul>
	</body>
</html>
Code Block
languagejavascript
"meta": [{
	"context": "First",
	"fieldName": "boldText",
	"scriptlang": "xpath",
	"script": "//b[1]" //can also be specified as /html[1]/body[1]/b[1]
	},
	{
	"context": "First",
	"fieldName": "boldTextDecoded",
	"scriptlang": "xpath",
	"script": "//b[1]",
	"flags": "H" //will HTML-decode resulting fields
	},
	{
	"context": "First",
	"fieldName": "favoriteTopics",
	"scriptlang": "xpath",
	"script": "//ul[@id='favTopics']/li[*]" //The asterisk wildcard character can be used to specify all items
	},
	{
	"context": "First",
	"fieldName": "notFavoriteTopics",
	"scriptlang": "xpath",
	"script": "//ul[@class='ugly']/li[*]regex[The Topic of (.*)]", //Regex can be specified as a content filter
	"groupNum": 1 //group number of regex
	}
] 

would generate the following different outputs (note the use of "groupNum" to select which capturing group to display):

Code Block
languagejavascript
"metadata": {
	"boldText": [ "Check out this really great site for News &amp; more!" ],
	"boldTextDecoded": [ "Check out this really great site for News & more!" ],
	"favoriteTopics": [ "Sport", "TV" ],
	"notFavoriteTopics": [ "Radio", "News" ],
 }

This final example, shows how "groupNum": -1 can be used to grab the entire object instead of just the text.  Note this is now deprecated, use "flags": "o" for the same effect (See below).

Consider the HTML block:

Code Block
languagehtml/xml
<html>
	<body>
		<a href="http://www.bbc.com">BBC</a>
	</body>
</html>

Then the following 2 XPath expressions:

Code Block
languagejavascript
"meta": [{
	"context": "First",
	"fieldName": "test1",
	"scriptlang": "xpath",
	"script": "//a[1]"
},
{
	"context": "First",
	"fieldName": "test2",
	// as above but with:
	"flags": "o" // formerly "groupNum": -1
},
{
	"context": "First",
	"fieldName": "test2",
	// as above but with:
	"flags": "x"
}
] 

would generate the following different outputs:

Code Block
languagejavascript
"metadata": {
	"test1": [ "BBC" ],
	"test2": [{
		"href": "http://www.bbc.com",
		"content": "BBC"
	}],
	"test3":"<a href=\"http://www.bbc.com\">BBC</a>"
}

 

IN PROGRESS

Legacy documentation:

...

 {
            "harvest": {
                "searchCycle_secs": 3600
            }
        },
        {
            "docMetadata": {
                "title": "$metadata.info.alert @ $metadata.info.date [$metadata.info.device]: $metadata.info.dstIP -> $metadata.info.srcIP",
                "publishedDate": "$SCRIPT( return _doc.metadata.info[0].date; )"
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "info",
                    "script": "var info = decode(text); info;",
                    "scriptlang": "javascript"
                }
            ]
        }

 

Metadata:

This captured metadata from the sample input data can then be used as output for the script.

Code Block
 ],    "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States",
    "mediaType": ["Log"],
    "metadata": {"info": [{
        "alert": "DUMMY_ALERT_TYPE_1 ",
        "country": "United States",
        "date": "2012-01-01T13:43:00",
        "device": "SCANNER_1 ",
        "dstIP": "66.66.66.66",
        "srcIP": " 10.0.0.1"
    }]},

 

Javascript can also return more complex objects, arrays of objects, or array of primitives.

 


Anchor
regex
regex
Regex

Log File

Source:

Consider the following alarm logs which include a record of device alerts, including their network and physical locations.

Code Block
Date,Device,SrcIP,dstIP,Alert,Country
SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States
SCANNER_2,2012-02-01T14:21:00,SCANNER_2,10.0.0.2,66.66.66.66,DUMMY_ALERT_TYPE_2,United Kingdom
SCANNER_3,2012-03-01T15:17:00,10.0.0.1,99.66.99.66,DUMMY_ALERT_TYPE_3,Netherlands

 

Source Configuration:

In the source configuration, a regex script is used to extract data to make up the "fullText" and "description" of the resulting document.

Code Block
   },
        {
            "text": [
                {
                    "fieldName": "fullText",
                    "script": ",",
                    "scriptlang": "regex",
                    "flags": "md",
                    "replacement": " , "
                },
                {
                    "fieldName": "description",
                    "script": ",",
                    "scriptlang": "regex",
                    "flags": "md",
                    "replacement": " , "
                }
            ]
        },


Output:

.The example output includes the "fullText" which results from the regex script.

Code Block
  }
    ],
    "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States",
    "mediaType": ["Log"],
    "metadata": {"info": [{
        "alert": "DUMMY_ALERT_TYPE_1 ",
        "country": "United States",
        "date": "2012-01-01T13:43:00",
        "device": "SCANNER_1 ",
        "dstIP": "66.66.66.66",
        "srcIP": " 10.0.0.1"
    }]},
    "modified": "Jun 4, 2013 12:54:34 AM UTC",
    "publishedDate": "January 1, 2012 13:43:00 PM UTC",
    "source": ["Cyber Logs Test"],
    "sourceKey": ["INFINITE_ENDPOINT.api.share.get.51ad28a440b4a4f0f757824c.25.26"],
    "tags": [
        "cyber",
        "structured"
    ],
    "title": "DUMMY_ALERT_TYPE_1  @ 2012-01-01T13:43:00 [SCANNER_1 ]: 66.66.66.66 -> 10.0.0.1",
    "url": "http://INFINITE_ENDPOINT/api/share/get/51ad28a440b4a4f0f757824c#1"
}

 


Anchor
xpath
xpath
Xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML.

As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).

In this example, an Xpath script is used as part of manual text extraction, in order to convert a sample XML document into JSON.

XML

Source Input:

Consider the following xml file, which includes a price list for several food items.

Code Block
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
   <food>
      <name>Belgian Waffles</name>
      <price>$5.95</price>
      <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
      <calories>650</calories>
   </food>
   <food>
      <name>Strawberry Belgian Waffles</name>
      <price>$7.95</price>
      <description>light Belgian waffles covered with strawberries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>Berry-Berry Belgian Waffles</name>
      <price>$8.95</price>
      <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
      <calories>900</calories>
   </food>
   <food>
      <name>French Toast</name>
      <price>$4.50</price>
      <description>thick slices made from our homemade sourdough bread</description>
      <calories>600</calories>
   </food>
   <food>
      <name>Homestyle Breakfast</name>
      <price>$6.95</price>
      <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
      <calories>950</calories>
   </food>
</breakfast_menu>

Source Configuration:

In the source configuration example below, a xpath script is specified to perform the JSON conversion.

Code Block
  {
            "links": {
                "extraMeta": [
                    {
                        "context": "First",
                        "fieldName": "convert_to_json",
                        "flags": "o",
                        "script": "//breakfast_menu/food[*]",
                        "scriptlang": "xpath"
                    }
                ],
                "script": "function convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        delete predoc.content;\n        var doc = {};\n        doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = predoc;\n        doc.title = \"TBD\";\n        doc.description = \"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                "scriptflags": "d"
            }

 

Output:

The output returns an array of JSON formatted responses:

 

Code Block
{
    "communityId": ["4d38b72c054548f038a0414a"],
    "created": "Jun 5, 2013 09:12:15 PM UTC",
    "description": "TBD",
    "fullText": "{
 \"calories\" : \"650\" , \"description\" : \"two of our famous Belgian 
Waffles with plenty of real maple syrup\" , \"price\" : \"$5.95\" , 
\"name\" : \"Belgian Waffles\"}",
    "mediaType": ["News"],
    "metadata": {"json": [{
        "calories": "650",
        "description": "two of our famous Belgian Waffles with plenty of real maple syrup",
        "name": "Belgian Waffles",
        "price": "$5.95"
    }]},
    "modified": "Jun 5, 2013 09:12:15 PM UTC",
    "publishedDate": "Jun 5, 2013 09:12:15 PM UTC",
    "source": ["aaa xml test"],
    "sourceKey": ["www.w3schools.com.xml.simple.xml"],
    "tags": ["tag1"],
    "title": "TBD",
    "url": "http://www.w3schools.com/xml/simple.xml#0"
}
{
    "communityId": ["4d38b72c054548f038a0414a"],
    "created": "Jun 5, 2013 09:12:15 PM UTC",
    "description": "TBD",
    "fullText": "{
 \"calories\" : \"900\" , \"description\" : \"light Belgian waffles 
covered with strawberries and whipped cream\" , \"price\" : \"$7.95\" , 
\"name\" : \"Strawberry Belgian Waffles\"}",
    "mediaType": ["News"],
    "metadata": {"json": [{
        "calories": "900",
        "description": "light Belgian waffles covered with strawberries and whipped cream",
        "name": "Strawberry Belgian Waffles",
        "price": "$7.95"
    }]},
    "modified": "Jun 5, 2013 09:12:15 PM UTC",
    "publishedDate": "Jun 5, 2013 09:12:15 PM UTC",
    "source": ["aaa xml test"],
    "sourceKey": ["www.w3schools.com.xml.simple.xml"],
    "tags": ["tag1"],
    "title": "TBD",
    "url": "http://www.w3schools.com/xml/simple.xml#1"
}

 

 


 

Panel

Footnotes:

Legacy documentation:

Legacy documentation: