...
This page has been broken down into the following sections for ease of localization.
Table of Contents |
---|
Format
TODO Convert to JSON
Code Block |
---|
{ "display": string, "contentMetadata": [{ {} "fieldName":string,// seeAny MetadataSpecPojostring, belowthe ]key } ////////////////////////////////// public static class MetadataSpecPojo { public String fieldName; // Any string, the key for generated array in "doc.metadata" public String scriptlang; for generated array in "doc.metadata" "scriptlang":string,// One of "javascript", "regex", "xpath", "stream" public String script; "script":string,// The script that will generate the array in "doc.metadata" (under fieldName) public String flags; "flags":flags,// Standard Java regex field (regex/xpath only), plus "H" to decode HTML, "D": will allow duplicate strings (by default they are de-duplicated), plus the following custom flags: // For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata (_metadata) // For xpath/strean: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1) public String replace; "replace":string,// Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc public Boolean store; "store":Boolean,// Whether this field should be stored in the DB or discarded after the harvest processing public Boolean index; "index":Boolean,// Whether this field should be full-text indexed or just stored in the DB }, ... ] } |
Description
The following table describes the parameters of the content metadata configuration.
Parameter | DescriptionNote | Data Types | |
---|---|---|---|
fieldName | Any string, the key for generated array in "doc.metadata" | ||
scriptlang | javascript, regex, xpath, or xpath. | ||
script | script that will generate the array. | ||
flags | For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata There are also a few flags that provide additional variables in the javascript:
| ||
For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. For reference, here is the complete set of flags for xpath (and regex, except for "O"):
| |||
replace | |||
store | |||
index |
Supported Script Languages
Javascript
For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.
Info |
---|
If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result. There are also a few flags that provide additional variables in the javascript:
|
XML file
In the following example, the "contentMetadata" block uses javascript to convert the XML file data into metadata. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.
...
stream. "stream" provides an efficient parser for splitting up potentially large XML/JSON objects into lots of smaller objects, either return chunks as text or as metadata (if flags is set to "o") | |||
script | script that will generate the array - eg the regular expression, xpath string, or JS script.
| ||
flags | For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata There are also a few flags that provide additional variables in the javascript:
| ||
For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. For reference, here is the complete set of flags for xpath (and regex, except for "O"):
| |||
replace | Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc. | ||
store | Whether this field should be stored in the DB or discarded after the harvest processing. | ||
index | Whether this field should be full-text indexed or just stored in the DB. |
Examples
Javascript
For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.
Info |
---|
If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result. There are also a few flags that provide additional variables in the javascript:
|
XML
In the following example, the "contentMetadata" block uses javascript to convert the XML file data into metadata. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.
Code Block |
---|
{ "description": "wiy", "isPublic": true, "mediaType": "News", "tags": [ "tag1" ], "title": "aaa xml test", "processingPipeline": [ { "feed": { "extraUrls": [ { "url": "http://www.w3schools.com/xml/simple.xml" } ], "updateCycle_secs": 86400 } }, { "links": { "extraMeta": [ { "context": "First", "fieldName": "convert_to_json", "flags": "o", "script": "//breakfast_menu/food[*]", "scriptlang": "xpath" } ], "script": "function convert_to_docs(jsonarray, url)\n{\n var docs = [];\n for (var docIt in jsonarray) {\n var predoc = jsonarray[docIt];\n delete predoc.content;\n var doc = {};\n doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n doc.fullText = predoc;\n doc.title = \"TBD\";\n doc.description = \"TBD\";\n docs.push(doc);\n }\n return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;", "scriptflags": "d" } }, { "feedcontentMetadata": {[ "extraUrls":{ [ { "fieldName": "json", "urlscript": "http://www.w3schools.com/xml/simple.xml"var json = eval('('+text+')'); json;", }"scriptlang": "javascript" } ], ] "updateCycle_secs": 86400 } } }] } |
Office Document
In the following example, the contentMetadata
block has been configured to specify a javascript that will create a metadata entity called "email_meta." Email_meta will report some meta data values for the "office" email type.
Code Block |
---|
}, { "linkscontentMetadata": { "extraMeta": [ { "contextfieldName": "Firstemail_meta", "fieldNamescript": "convert_to_json "var x=_metadata._FILE_METADATA_[0].metadata;x;", "flagsscriptlang": "ojavascript", "scriptflags": "//breakfast_menu/food[*]",m" } "scriptlang": "xpath" ] }, |
In the sample output, we can see the new metadata entity "email_meta" which has been created by the contentMetadata
block.
Code Block |
---|
], } "email_meta": [ [ ], "script": "function convert_to_docs(jsonarray, url)\n{\n var docs = [];\n for (var docIt in jsonarray) {\n var predoc = jsonarray[docIt];\n"Creation-Date": [ delete predoc.content;\n var doc = {};\n "2001-07-09T18:33:32Z" doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n doc.fullText = predoc;\n], doc.title = \"TBD\";\n doc.description = \"TBD\";\nMessage-To": [ docs.push(doc);\n }\n return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;", "will.smith@enron.com" ], "scriptflags": "d" } "Content-Type": [ }, { "contentMetadata": [message/rfc822" {], "fieldNamesubject": "json",[ "script"RE: Testing "var json = eval('('+text+')'); json;",Preschedule workspace" ], "scriptlang": "javascript" "date": [ } ] "2001-07-09T18:33:32Z" } ] } |
Office Document
In the following example, the contentMetadata
block has been configured to specify a javascript that will create a metadata entity called "email_meta." Email_meta will report some meta data values for the "office" email type.
Code Block |
---|
} ], { "contentMetadataAuthor": [ { "cara.semperger@enron.com" "fieldName": "email_meta" ], "scriptMessage-From": "var x=_metadata._FILE_METADATA_[0].metadata;x;", [ "cara.semperger@enron.com" "scriptlang": "javascript", ] |
...
Regex
XML
In the code block below, regex is used in a script which will create a metadata field called "organization." Organization can then be referenced in scripts to create entities and associations.
Code Block |
---|
}, { "flags": "m" "contentMetadata": [ } { ] }, |
In the sample output, we can see the new metadata entity "email_meta" which has been created by the contentMetadata
block.
Code Block |
---|
], "email_metafieldName": ["organization", [ "script": "believed the (.*?)(?: {\\([^)]*\\))? (was|were) responsible", "Creation-Datescriptlang": ["regex" }, "2001-07-09T18:33:32Z" { ], "Message-To"fieldName": [ "organization", "will.smith@enron.com" "script": "believed (.*?)(?: \\([^)]*\\))? (was|were) responsible", ], "scriptlang": "regex" "Content-Type": [ }, { "message/rfc822" ]"fieldName": "organization", "subjectscript": ". ([^.]*?)(?: \\([^)]*\\))? claimed responsibility\\.$", "REscriptlang": Testing Preschedule workspace"regex" } ], ] "date": [}, |
In the code block below, an entity "Who" is created by referencing the metadata field "metadata.organization."
Code Block |
---|
}, { "2001-07-09T18:33:32Z" "entities": [ ],{ "AuthorcreationCriteriaScript": [ "$FUNC( isOrganizationSpecified(); )", "dimension": "cara.semperger@enron.com"Who", ]"disambiguated_name": "$metadata.organization", "Message-Fromtype": [ "Organization", "cara.semperger@enron.comuseDocGeo": false ] |
Regex
...
}, |
...
Xpath
...
Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).
The following example shows how Xpath can be used to extract embedded HTML from an XML document for the creation of entities and associations.
XML
The example XML data contains some severe weather incident reports. For each report, we would like to extract the embedded HTML to create entities.
...