...
Parameter | Description | Note | Data Types |
---|---|---|---|
fieldName | Any string, the key for generated array in "doc.metadata" | ||
scriptlang | |||
script | |||
flags | For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata There are also a few flags that provide additional variables in the javascript:
| ||
For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. For reference, here is the complete set of flags for xpath (and regex, except for "O"):
| |||
replace | |||
store | |||
index |
Using Script Languages to Generate Metadata
...
Info |
---|
If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result. There are also a few flags that provide additional variables in the javascript:
|
XMl file:
In the following example, the "contentMetadata" block uses javascript to convert the xml XML file data into metadata. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.
...
Code Block |
---|
{ "description": "wiy", "isPublic": true, "mediaType": "News", "tags": [ "tag1" ], "title": "aaa xml test", "processingPipeline": [ { "feed": { "extraUrls": [ { "url": "http://www.w3schools.com/xml/simple.xml" } ], "updateCycle_secs": 86400 } }, { "links": { "extraMeta": [ { "context": "First", "fieldName": "convert_to_json", "flags": "o", "script": "//breakfast_menu/food[*]", "scriptlang": "xpath" } ], "script": "function convert_to_docs(jsonarray, url)\n{\n var docs = [];\n for (var docIt in jsonarray) {\n var predoc = jsonarray[docIt];\n delete predoc.content;\n var doc = {};\n doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n doc.fullText = predoc;\n doc.title = \"TBD\";\n doc.description = \"TBD\";\n docs.push(doc);\n }\n return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;", "scriptflags": "d" } }, { "contentMetadata": [ { "fieldName": "json", "script": "var json = eval('('+text+')'); json;", "scriptlang": "javascript" } ] } ] } |
Regex
Office Document:
In the following example, the contentMetadata
block has been configured to specify a javascript that will create a metadata entity called "email_meta." Email_meta will report some meta data values for the "office" email type.
Code Block |
---|
}, {
"contentMetadata": [
{
"fieldName": "email_meta",
"script": "var x=_metadata._FILE_METADATA_[0].metadata;x;",
"scriptlang": "javascript",
"flags": "m"
}
]
}, |
In the sample output, we can see the new metadata entity "email_meta" which has been created by the contentMetadata
block.
Code Block |
---|
], "email_meta": [
[
{
"Creation-Date": [
"2001-07-09T18:33:32Z"
],
"Message-To": [
"will.smith@enron.com"
],
"Content-Type": [
"message/rfc822"
],
"subject": [
"RE: Testing Preschedule workspace"
],
"date": [
"2001-07-09T18:33:32Z"
],
"Author": [
"cara.semperger@enron.com"
],
"Message-From": [
"cara.semperger@enron.com"
] |
Regex
IN PROGRESS
Xpath
Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).
...