...
Starting with either the raw content (or the content transformed by a preceding manual or automated text pipeline element), applies the javascript, regex, or xpath transformation and writes the output to the document's full text (or description, or title, or one of the textual metadata fields).
TODO
Table of Contents |
---|
Format
TODO convert to JSON
...
- See under "simpleTextCleanser object"
- (note headers and footers are no longer supported - you can just do this manually now)
TODO
Description
Using manual text transformation you can specify the data source for your script to work on. The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.
The following parameters are used in the configuration of manual text transformation
Parameter | Description | Note | Data Type |
---|---|---|---|
fieldName | Specifies the data source that the script will execute against "fullText," "description," or "title" | ||
script | Specify your script | ||
flags | Standard Java regex field Can have different values, based on See below. | ||
javascript: There are a few flags that provide additional variables in the javascript:
| |||
xpath (and regex, except for "O"):
| |||
replacement | If eg. You could find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female. | ||
scriptlang | Specifies the language of the script that will be provided One of "javascript," "regex," or "xpath" |
Supported Script Languages
You can program manual text extraction using the following supported langugaes
- javascript
- regex
- xpath
javascript
For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.
By default, only one input variable is included: "text", which corresponds to the "fullText" field of the document JSON.
Info |
---|
If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result. |
Examples
For example, consider the following javascript, which (like the regex example above) pulls the address out of the example letter format.
Code Block | ||||
---|---|---|---|---|
| ||||
var i = text.indexOf("address:");
var j = text.indexOf("\n", i); // (starts looking after address)
var returnVal = null;
if (i >= 0 && j >= 0) {
returnVal = text.substring(i, j).trim();
}
returnVal; |
Info |
---|
Note the slightly unusual way in which the object/primitive is "returned": whatever is evaluated on the final line. The easiest way of managing this is to have a single standalone line containing a previously-declared "var" at the end. |
Then this would be embedded as follows in a "meta" object:
Code Block | ||
---|---|---|
| ||
{
"headerRegEx" : "^.*\*+",
"footerRegEx" : "_+.*$",
"meta" : [ {
"context": "Body",
"fieldName": "addressMetadata",
"scriptlang": "javascript",
"script": "var i = text.indexOf(\"address:\");\nvar j = text.indexOf('\n', i);\nvar returnVal = null;\nif (i >= 0 && j >= 0) {\nreturnVal = text.substring(i, j).trim();\nreturnVal;"
} ]
} |
Obviously the javascript can also return more complex objects, arrays of objects, or array of primitives.
Note that using "\n"s in the embedded script is recommended, since then runtime javascript errors (reported in the "harvest.harvest_message" field of the source object) will map the line number.
Regex
The regular expression used to find the data labeled by fieldName is placed in the script string. This regular expression makes use of groups, specified by groupNum. A group is a pair of parentheses used to group subpatterns.
Examples
For example, h(a|i)t matches hat or hit. A group also captures the matching text within the parentheses. For example:
Code Block | ||
---|---|---|
| ||
{
input: abbc
pattern: a(b*)c
} |
causes the substring bb to be captured by the group (b*). If the use of groups is not desired, groupNum should be set to the number 0 (zero), ie to get the entirety of the matching pattern.
In the case that the desired purpose of the regular express is to do a replace, this replace string can be specified in replace. For example,
Code Block | ||
---|---|---|
| ||
{
"fieldName" : "Race",
"context" : "All",
"regEx" : "C/[F|M]",
"groupNum" : 0,
"replace" : "Caucasian"
} |
would find the instance C/M or C/F in a document and extract that it is important to note that the Race is Caucasian. The same can be done to extract M or F as a Sex meaning Male or Female.
Other than a standard set of POSIX fiags ("midun"), there are some additional, infinit.e-specific, regex fields which are described under XPath, see below.
xpath
Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).
As a result, Infinit.e supports XPath 1.0 (with one minor extension to allow combined XPath regex).
Examples
Consider the Following Examples:
Code Block | ||
---|---|---|
| ||
<html>
<body>
<b>Check out this really great site for News & more!</b>
<a href="http://www.bbc.com">BBC</a>
<i>List of my favorite topics</i>
<ul id="favTopics">
<li>Sport</li>
<li>TV</li>
</ul>
<i>List of my not-so favorite topics</i>
<ul class="ugly">
<li>The Topic of Radio</li>
<li>The Topic of News</li>
</ul>
</body>
</html> |
Code Block | ||
---|---|---|
| ||
"meta": [{
"context": "First",
"fieldName": "boldText",
"scriptlang": "xpath",
"script": "//b[1]" //can also be specified as /html[1]/body[1]/b[1]
},
{
"context": "First",
"fieldName": "boldTextDecoded",
"scriptlang": "xpath",
"script": "//b[1]",
"flags": "H" //will HTML-decode resulting fields
},
{
"context": "First",
"fieldName": "favoriteTopics",
"scriptlang": "xpath",
"script": "//ul[@id='favTopics']/li[*]" //The asterisk wildcard character can be used to specify all items
},
{
"context": "First",
"fieldName": "notFavoriteTopics",
"scriptlang": "xpath",
"script": "//ul[@class='ugly']/li[*]regex[The Topic of (.*)]", //Regex can be specified as a content filter
"groupNum": 1 //group number of regex
}
] |
would generate the following different outputs (note the use of "groupNum" to select which capturing group to display):
Code Block | ||
---|---|---|
| ||
"metadata": {
"boldText": [ "Check out this really great site for News & more!" ],
"boldTextDecoded": [ "Check out this really great site for News & more!" ],
"favoriteTopics": [ "Sport", "TV" ],
"notFavoriteTopics": [ "Radio", "News" ],
} |
This final example, shows how "groupNum": -1 can be used to grab the entire object instead of just the text. Note this is now deprecated, use "flags": "o" for the same effect (See below).
Consider the HTML block:
Code Block | ||
---|---|---|
| ||
<html>
<body>
<a href="http://www.bbc.com">BBC</a>
</body>
</html> |
Then the following 2 XPath expressions:
Code Block | ||
---|---|---|
| ||
"meta": [{
"context": "First",
"fieldName": "test1",
"scriptlang": "xpath",
"script": "//a[1]"
},
{
"context": "First",
"fieldName": "test2",
// as above but with:
"flags": "o" // formerly "groupNum": -1
},
{
"context": "First",
"fieldName": "test2",
// as above but with:
"flags": "x"
}
] |
would generate the following different outputs:
Code Block | ||
---|---|---|
| ||
"metadata": {
"test1": [ "BBC" ],
"test2": [{
"href": "http://www.bbc.com",
"content": "BBC"
}],
"test3":"<a href=\"http://www.bbc.com\">BBC</a>"
} |
IN PROGRESS
Legacy documentation:
TODO
Examples
TODO