View Source

The Infinit.e Unstructured Analysis Harvester is designed to take specified regular expressions and enrich the data via the use of regular expressions and specified text extraction modules.

The example Source.structuredAnalysis object below demonstrates the basic features of specifying how to enrich harvested structured data.

There is also a reference page for the Unstructured Analysis configuration object.

The Harvesting Process

Specifying a document's Header/Footer

The harvester first evaluates the strings specified for the headerRegEx and footerRegEx. These are each regular expression strings that specify if a document has a header or if the document has a footer. This is useful to be able to negate a lot of formatting that normally occurs in headers/footers that would not result in correctly extracted entities or events. It is important to note that the header/footer patterns are matched in DOTALL mode (other regexes are not unless the flag "d" is specified).

For Example:

Address: 123 Sample Dr. Woodbridge, VA 22191

To: John Doe

********************

The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

Sincerely,

Sample Person

_________________________

Sample Person

Address

Company

{
   "headerRegEx" : "^.*\*+",
   "footerRegEx" : "_+.*$",
   "meta" : [ {
       "context": "Body",
       "fieldName": "addressMetadata",
       "scriptlang": "regex",
       "script": "address:\\s*(.*)",
       "groupNum": 1,
       "flags": "i" // (i for interactive) 
   } ]
}

The above shows valid regular expressions to use to parse the sample documents header and footer, and will create the following metadata object:

{
   "metadata": {
      "addressMetadata": [     
         "123 Sample Dr. Woodbridge, VA 22191"
      ]
   }
}

Specifying data as metadata

After checking if a header/footer is specified, the harvester then processes all Objects in the meta list. Meta Object's variable fieldName will be the 'key' term or the label in the list. Context is a enum variable used to specify where to check for the specified regular expression, script (with scriptlang: "regex"). Context has 5 possible values:

First - If this is specified,the script/regex is applied before any text cleansing (ie on the entirety of the raw content)
Header - If this is specified, headerRegEx must also be specified, and then script/regex is only applied to the "header" section
Footer - If this is specified, footerRegEx must also be specifiedm and then script/regex is only applied to the "footer" section
Body - Checks the text not specified as the header or footer. If there is no body to the feed, the description of the feed will be checked.
All - This checks for the regular expression in all areas

The regular expression used to find the data labeled by fieldName is placed in the script string. This regular expression makes use of groups, specified by groupNum. A group is a pair of parentheses used to group subpatterns. For example, h(a|i)t matches hat or hit. A group also captures the matching text within the parentheses. For example:

{
   input:   abbc
   pattern: a(b*)c
}

causes the substring bb to be captured by the group (b*). If the use of groups is not desired, groupNum should be set to the number 0 (zero), ie to get the entirety of the matching pattern.

In the case that the desired purpose of the regular express is to do a replace, this replace string can be specified in replace.
For Example:

{
   "fieldName" : "Race",
   "context" : "All",
   "regEx" : "C/[F|M]",
   "groupNum" : 0,
   "replace" : "Caucasian"
}

would find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.

Other than a standard set of POSIX fiags ("midun"), there are some additional, infinit.e-specific, regex fields which are described under XPath, see below.

Using Javascript to generate metadata

For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.

By default, only one input variable is included: "text", which corresponds to the "fullText" field of the document JSON.

If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result.

There are also a few flags that provide additional variables in the javascript:

"m" to get "_doc.metadata", written into the variable "_metadata"
- (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
"d" to get "_doc", written into the variable "_doc",
"t" to return the full text of the document into "text".
- If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.

For example, consider the following javascript, which (like the regex example above) pulls the address out of the example letter format.

var i = text.indexOf("address:"); 
var j = text.indexOf("\n", i); // (starts looking after address)
var returnVal = null;
if (i >= 0 && j >= 0) {
   returnVal = text.substring(i, j).trim();
}
returnVal;

Note the slightly unusual way in which the object/primitive is "returned": whatever is evaluated on the final line. The easiest way of managing this is to have a single standalone line containing a previously-declared "var" at the end.

Then this would be embedded as follows in a "meta" object:

{
   "headerRegEx" : "^.*\*+",
   "footerRegEx" : "_+.*$",
   "meta" : [ {
       "context": "Body",
       "fieldName": "addressMetadata",
       "scriptlang": "javascript",
       "script": "var i = text.indexOf(\"address:\");\nvar j = text.indexOf('\n', i);\nvar returnVal = null;\nif (i >= 0 && j >= 0) {\nreturnVal = text.substring(i, j).trim();\nreturnVal;"
   } ]
}

Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.

Note that using "\n"s in the embedded script is recommended, since then runtime javascript errors (reported in the "harvest.harvest_message" field of the source object) will map the line number.

The same security restrictions as for the Structured Analysis Harvester apply.

Using XPath to generate metadata

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).

Consider the Following Examples:

<html>
	<body>
		 <b>Check out this really great site for News &amp; more!</b>
		 <a href="http://www.bbc.com">BBC</a>
		<i>List of my favorite topics</i>
		<ul id="favTopics">
			<li>Sport</li>
			<li>TV</li>
		</ul>
		<i>List of my not-so favorite topics</i>
		<ul class="ugly">
			<li>The Topic of Radio</li>
			<li>The Topic of News</li>
		</ul>
	</body>
</html>

"meta": [{
	"context": "First",
	"fieldName": "boldText",
	"scriptlang": "xpath",
	"script": "//b[1]" //can also be specified as /html[1]/body[1]/b[1]
	},
	{
	"context": "First",
	"fieldName": "boldTextDecoded",
	"scriptlang": "xpath",
	"script": "//b[1]",
	"flags": "H" //will HTML-decode resulting fields
	},
	{
	"context": "First",
	"fieldName": "favoriteTopics",
	"scriptlang": "xpath",
	"script": "//ul[@id='favTopics']/li[*]" //The asterisk wildcard character can be used to specify all items
	},
	{
	"context": "First",
	"fieldName": "notFavoriteTopics",
	"scriptlang": "xpath",
	"script": "//ul[@class='ugly']/li[*]regex(The Topic of (.*))", //Regex can be specified as a content filter
	"groupNum": 1 //group number of regex
	}
]

would generate the following different outputs (note the use of "groupNum" to select which capturing group to display):

"metadata": {
	"boldText": [ "Check out this really great site for News &amp; more!" ],
	"boldTextDecoded": [ "Check out this really great site for News & more!" ],
	"favoriteTopics": [ "Sport", "TV" ],
	"notFavoriteTopics": [ "Radio", "News" ],
 }

This final example, shows how "groupNum": -1 can be used to grab the entire object instead of just the text. Note this is now deprecated, use "flags": "o" for the same effect (See below).

Consider the HTML block:

<html>
	<body>
		<a href="http://www.bbc.com">BBC</a>
	</body>
</html>

Then the following 2 XPath expressions:

"meta": [{
	"context": "First",
	"fieldName": "test1",
	"scriptlang": "xpath",
	"script": "//a[1]"
},
{
	"context": "First",
	"fieldName": "test2",
	// as above but with:
	"flags": "o" // formerly "groupNum": -1
},
{
	"context": "First",
	"fieldName": "test2",
	// as above but with:
	"flags": "x"
}
]

would generate the following different outputs:

"metadata": {
	"test1": [ "BBC" ],
	"test2": [{
		"href": "http://www.bbc.com",
		"content": "BBC"
	}],
	"test3":"<a href=\"http://www.bbc.com\">BBC</a>"
}

For reference, here is the complete set of flags for xpath (and regex, except for "O"):

'H': will HTML-decode resulting fields. (Eg "&" -> "&")
'o': if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
'D': described above
'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)

Lookup tables in the Unstructured Analysis Handler

It is possible to add lookup tables from JSON shares that can be used in all the javascript scripts in the unstructured analysis handler (and also the structured analysis handler).

These lookup tables to provide a limited form of aliasing a harvest time - also check out the full query-time aliasing capability - in addition to many other cases where a potentially large and dynamic lookup table would be useful.

Using the lookup technology is easy:

At the top level of the "unstructuredAnalysis" object, create a "caches" object that consists of the following:
- For every lookup table, a local name you specify and then the "_id" field of a JSON share (see share API documentation, or uploaded via the File Uploader). For example:

"unstructuredAnalysis": {
	"caches": {
		"myLookupTable": "4e0c7e99eb5af0fbdcfbf697"
	}
}

Then within any script in the "unstructuredAnalysis", you can access the JSON object by indexing "_cache" with the local name specified as above. For example, say the following JSON object has been uploaded:

{
	//...
	"US": "United States", "USA", "United States of America",
	"UK": "United Kingdom", "Great Britain", "GB",
	//...
}

Then the lookup table could be used as follows:

{
	"unstructuredAnalysis": {
		// (caches object specified as above)
		//...
		"meta": [
			//...
			{
				"fieldName": "disambiguatedCountryName",
				"context": "All",
				"fields": "m",
				"scriptlang": "javascript",
				"script": "_cache.myLookupTable[ _metadata.countryName[0] ];"
			}
		],
		//...
	}
}