Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

ParameterDescriptionNoteData Types
fieldName
Any string, the key for generated array in "doc.metadata"
  
scriptlang   
script   
flags

For

javascript

(defaults

to

"t"

if

none

specified),

"t"

the

script

receives

the

doc

fullText

("text"),

"d"

the

script

receives

the

entire

doc

(_doc),

"m"

the

script

receives

the

doc.metadata
 

There are also a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". 
    • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.
  

For

xpath:

"o":

if

the

XPath

expression

points

to

an

HTML

(/XML)

object,

then

this

object

is

converted

to

JSON

and

stored

as

an

object

in

the

corresponding

metadata

field

array.

For reference, here is the complete set of flags for xpath (and regex, except for "O"):

  • 'H': will HTML-decode resulting fields. (Eg "&" -> "&")
  • 'o': if  the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
  • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
  • 'D': described above 
  • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
  
replace   
store   
index   

 

Using Script Languages to Generate Metadata

...

Info

If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result.

There are also a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". 
    • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.

 

XMl file:

In the following example, the "contentMetadata" block uses javascript to convert the xml XML file data into metadata. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.

...

Code Block
{
    "description": "wiy",
    "isPublic": true,
    "mediaType": "News",
    "tags": [
        "tag1"
    ],
    "title": "aaa xml test",
    "processingPipeline": [
        {
            "feed": {
                "extraUrls": [
                    {
                        "url": "http://www.w3schools.com/xml/simple.xml"
                    }
                ],
                "updateCycle_secs": 86400
            }
        },
        {
            "links": {
                "extraMeta": [
                    {
                        "context": "First",
                        "fieldName": "convert_to_json",
                        "flags": "o",
                        "script": "//breakfast_menu/food[*]",
                        "scriptlang": "xpath"
                    }
                ],
                "script": "function convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        delete predoc.content;\n        var doc = {};\n        doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = predoc;\n        doc.title = \"TBD\";\n        doc.description = \"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                "scriptflags": "d"
            }
        },
        {
            "contentMetadata": [
                {
                    "fieldName": "json",
                    "script": "var json = eval('('+text+')'); json;",
                    "scriptlang": "javascript"
                }
            ]
        }
    ]
}

 

Regex

 Office Document:

In the following example, the contentMetadata block has been configured to specify a javascript that will create a metadata entity called "email_meta."  Email_meta will report some meta data values for the "office" email type.

Code Block
 },        {
            "contentMetadata": [
                {
                    "fieldName": "email_meta",
                    "script": "var x=_metadata._FILE_METADATA_[0].metadata;x;",
                    "scriptlang": "javascript",
                    "flags": "m"
                }
            ]
        },

 

In the sample output, we can see the new metadata entity "email_meta" which has been created by the contentMetadata block.

Code Block
 ],        "email_meta": [
            [
                {
                    "Creation-Date": [
                        "2001-07-09T18:33:32Z"
                    ],
                    "Message-To": [
                        "will.smith@enron.com"
                    ],
                    "Content-Type": [
                        "message/rfc822"
                    ],
                    "subject": [
                        "RE: Testing Preschedule workspace"
                    ],
                    "date": [
                        "2001-07-09T18:33:32Z"
                    ],
                    "Author": [
                        "cara.semperger@enron.com"
                    ],
                    "Message-From": [
                        "cara.semperger@enron.com"
                    ]

Regex

IN PROGRESS

Xpath

 

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

...