Page Comparison

...

For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.

By default, only one input variable is included: "text", which corresponds to the "fullText" field of the document JSON.

Info
If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result.

Examples

For example, consider the following javascript, which (like the regex example above) pulls the address out of the example letter format.

Code Block

language	javascript
title	Simple javascript to be embedded in "meta" object

var i = text.indexOf("address:"); 
var j = text.indexOf("\n", i); // (starts looking after address)
var returnVal = null;
if (i >= 0 && j >= 0) {
   returnVal = text.substring(i, j).trim();
}
returnVal;

Info
Note the slightly unusual way in which the object/primitive is "returned": whatever is evaluated on the final line. The easiest way of managing this is to have a single standalone line containing a previously-declared "var" at the end.

Then this would be embedded as follows in a "meta" object:

...

title	Source.unstructuredAnalysis object

...

log file from file share

In the following example, manual text transformation is used to parse a log file over the web, with a script of type javascript.

Code Block

  },        {
            "contentMetadata": [
                {
                    "fieldName": "info",
                    "script": "var info = decode(text); info;",
                    "scriptlang": "javascript"
                }
            ]
        },
        {
            "text": [
                {
                    "fieldName": "fullText",
                    "contextscript": "Body,",
           "fieldName": "addressMetadata",        "scriptlang": "javascriptregex",
         "script": "var i = text.indexOf(\"address:\");\nvar j = text.indexOf('\n', i);\nvar returnVal = null;\nif (i >= 0 && j >= 0) {\nreturnVal = text.substring(i, j).trim();\nreturnVal;"
   } ]
}

Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.

Note that using "\n"s in the embedded script is recommended, since then runtime javascript errors (reported in the "harvest.harvest_message" field of the source object) will map the line number.

Regex

The regular expression used to find the data labeled by fieldName is placed in the script string. This regular expression makes use of groups, specified by groupNum. A group is a pair of parentheses used to group subpatterns.

Examples

For example, h(a|i)t matches hat or hit. A group also captures the matching text within the parentheses. For example:

...

{
   input:   abbc
   pattern: a(b*)c
}

causes the substring bb to be captured by the group (b*). If the use of groups is not desired, groupNum should be set to the number 0 (zero), ie to get the entirety of the matching pattern.

In the case that the desired purpose of the regular express is to do a replace, this replace string can be specified in replace. For example,

Code Block

title	Source.unstructuredAnalysis.meta object

{
   "fieldName" : "Race",
   "context" : "All",
   "regEx" : "C/[F|M]",
   "groupNum" : 0,
   "replace" : "Caucasian"
}

would find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.

Other than a standard set of POSIX fiags ("midun"), there are some additional, infinit.e-specific, regex fields which are described under XPath, see below.

xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).

Examples

Consider the Following Examples:

Code Block

language	html/xml

<html>
	<body>
		 <b>Check out this really great site for News &amp; more!</b>
		 <a href="http://www.bbc.com">BBC</a>
		<i>List of my favorite topics</i>
		<ul id="favTopics">
			<li>Sport</li>
			<li>TV</li>
		</ul>
		<i>List of my not-so favorite topics</i>
		<ul class="ugly">
			<li>The Topic of Radio</li>
			<li>The Topic of News</li>
		</ul>
	</body>
</html>

Code Block

language	javascript

"meta": [{
	"context": "First",
	"fieldName": "boldText",
	"scriptlang": "xpath",
	"script": "//b[1]" //can also be specified as /html[1]/body[1]/b[1]
	},
	{
	"context": "First",
	"fieldName": "boldTextDecoded",
	"scriptlang": "xpath",
	"script": "//b[1]",
	"flags": "H" //will HTML-decode resulting fields
	},
	{
	"context": "First",
	"fieldName": "favoriteTopics",
	"scriptlang": "xpath",
	"script": "//ul[@id='favTopics']/li[*]" //The asterisk wildcard character can be used to specify all items
	},
	{
	"context": "First",
	"fieldName": "notFavoriteTopics",
	"scriptlang": "xpath",
	"script": "//ul[@class='ugly']/li[*]regex[The Topic of (.*)]", //Regex can be specified as a content filter
	"groupNum": 1 //group number of regex
	}
]

would generate the following different outputs (note the use of "groupNum" to select which capturing group to display):

Code Block

language	javascript

"metadata": {
	"boldText": [ "Check out this really great site for News &amp; more!" ],
	"boldTextDecoded": [ "Check out this really great site for News & more!" ],
	"favoriteTopics": [ "Sport", "TV" ],
	"notFavoriteTopics": [ "Radio", "News" ],
 }

This final example, shows how "groupNum": -1 can be used to grab the entire object instead of just the text. Note this is now deprecated, use "flags": "o" for the same effect (See below).

Consider the HTML block:

Code Block

language	html/xml

<html>
	<body>
		<a href="http://www.bbc.com">BBC</a>
	</body>
</html>

Then the following 2 XPath expressions:

Code Block

language	javascript

"meta": [{
	"context": "First",
	"fieldName": "test1",
	"scriptlang": "xpath",
	"script": "//a[1]"
},
{
	"context": "First",
	"fieldName": "test2",
	// as above but with:
	"flags": "o" // formerly "groupNum": -1
},
{
	"context": "First",
	"fieldName": "test2",
	// as above but with:
	"flags": "x"
}
]

would generate the following different outputs:

Code Block

language	javascript

"metadata": {
	"test1": [ "BBC" ],
	"test2": [{
		"href": "http://www.bbc.com",
		"content": "BBC"
	}],
	"test3":"<a href=\"http://www.bbc.com\">BBC</a>"
}

IN PROGRESS

Legacy documentation:

Unstructured Analysis - Overview

...

        "flags": "md",
                    "replacement": " , "
                },
                {
                    "fieldName": "description",
                    "script": ",",
                    "scriptlang": "regex",
                    "flags": "md",
                    "replacement": " , "
                }
            ]
        },

After "globals" has been used to define a variable called info, info can be used to capture the metadata for the sample input data. the metada that will be captured in the example is as follows:

info.date
info.srcIP
info.dstIP
info.alert
info.country

This captured metadata from th esample input data can then be used as output for the script.

Code Block

 ],    "fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States",
    "mediaType": ["Log"],
    "metadata": {"info": [{
        "alert": "DUMMY_ALERT_TYPE_1 ",
        "country": "United States",
        "date": "2012-01-01T13:43:00",
        "device": "SCANNER_1 ",
        "dstIP": "66.66.66.66",
        "srcIP": " 10.0.0.1"
    }]},

Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.

Regex

xml

The following example shows how a regex script can be used to manually parse the text of the ingested data

Code Block

 },        {
            "contentMetadata": [
                {
                    "fieldName": "organization",
                    "script": "believed the (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                    "scriptlang": "regex"
                },
                {
                    "fieldName": "organization",
                    "script": "believed (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                    "scriptlang": "regex"
                },
                {
                    "fieldName": "organization",
                    "script": ".  ([^.]*?)(?: \\([^)]*\\))? claimed responsibility\\.$",
                    "scriptlang": "regex"
                }
            ]
        },

The example code snipet, the manual text transformation is defining a field name called "organization" and it uses Regex to search the input XML data to find matches. In the case of this example, the XML data is an incident report.

.The sample output reports that no known "organization" was implicated.

Code Block
}], "multipledays": ["No"], "organization": ["No group"], "perpetrator": [{ "characteristic": "Islamic Extremist (Sunni)", "nationality": "Unknown" }],

xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).

In this example, an xpath script is used as part of manual text extraction, in order to convert a sample XML document into JSON.

Code Block

},        {
            "links": {
                "extraMeta": [
                    {
                        "context": "First",
                        "fieldName": "convert_to_json",
                        "flags": "o",
                        "script": "//breakfast_menu/food[*]",
                        "scriptlang": "xpath"
                    }
                ],
                "script": "function
 convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var 
docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        
delete predoc.content;\n        var doc = {};\n        doc.url = 
_doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = 
predoc;\n        doc.title = \"TBD\";\n        doc.description = 
\"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs =
 convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                "scriptflags": "d"
            }
        },

The sample output would then return a series of JSON formatted responses. For example,

Code Block

{
    "communityId": ["4d38b72c054548f038a0414a"],
    "created": "Jun 5, 2013 09:12:15 PM UTC",
    "description": "TBD",
    "fullText": "{
 \"calories\" : \"650\" , \"description\" : \"two of our famous Belgian 
Waffles with plenty of real maple syrup\" , \"price\" : \"$5.95\" , 
\"name\" : \"Belgian Waffles\"}",
    "mediaType": ["News"],
    "metadata": {"json": [{
        "calories": "650",
        "description": "two of our famous Belgian Waffles with plenty of real maple syrup",
        "name": "Belgian Waffles",
        "price": "$5.95"
    }]},
    "modified": "Jun 5, 2013 09:12:15 PM UTC",
    "publishedDate": "Jun 5, 2013 09:12:15 PM UTC",
    "source": ["aaa xml test"],
    "sourceKey": ["www.w3schools.com.xml.simple.xml"],
    "tags": ["tag1"],
    "title": "TBD",
    "url": "http://www.w3schools.com/xml/simple.xml#0"
}
{
    "communityId": ["4d38b72c054548f038a0414a"],
    "created": "Jun 5, 2013 09:12:15 PM UTC",
    "description": "TBD",
    "fullText": "{
 \"calories\" : \"900\" , \"description\" : \"light Belgian waffles 
covered with strawberries and whipped cream\" , \"price\" : \"$7.95\" , 
\"name\" : \"Strawberry Belgian Waffles\"}",
    "mediaType": ["News"],
    "metadata": {"json": [{
        "calories": "900",
        "description": "light Belgian waffles covered with strawberries and whipped cream",
        "name": "Strawberry Belgian Waffles",
        "price": "$7.95"
    }]},
    "modified": "Jun 5, 2013 09:12:15 PM UTC",
    "publishedDate": "Jun 5, 2013 09:12:15 PM UTC",
    "source": ["aaa xml test"],
    "sourceKey": ["www.w3schools.com.xml.simple.xml"],
    "tags": ["tag1"],
    "title": "TBD",
    "url": "http://www.w3schools.com/xml/simple.xml#1"
}

Versions Compared

Old Version 3

New Version 4

Key

Examples

log file from file share

Regex

Examples

xpath

Examples

Regex

xml

xpath