Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

ParameterDescriptionNoteData Types
fieldName
Any string, the key for generated array in "doc.metadata"
  
scriptlangjavascript, regex or xpath.  
scriptscript that will generate the array.  
flags

For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata
 

There are also a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". 
    • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.
  • 'H': HTML code the resulting field (same as for regex/xpath except that it only applied to single value returns. To decode HTML in more complex expressions/arrays, you can always call org.apache.commons.lang.StringEscapeUtils.unescapeHtml('' + js_string)
  

For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array.

For reference, here is the complete set of flags for xpath (and regex, except for "O"):

  • 'H': will HTML-decode resulting fields. (Eg "&" -> "&")
  • 'o': if  the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
  • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
  • 'D': described above 
  • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
  
replace 

Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc.

  
store 

Whether this field should be stored in the DB or discarded after the harvest processing.

  
index 

Whether this field should be full-text indexed or just stored in the DB.

  

Supported Script Languages

...

Info

If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result.

There are also a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". 
    • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.

XML

...


In the following example, the "contentMetadata" block uses javascript to convert the XML file data into metadata. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.

...

Code Block
 ],        "email_meta": [
            [
                {
                    "Creation-Date": [
                        "2001-07-09T18:33:32Z"
                    ],
                    "Message-To": [
                        "will.smith@enron.com"
                    ],
                    "Content-Type": [
                        "message/rfc822"
                    ],
                    "subject": [
                        "RE: Testing Preschedule workspace"
                    ],
                    "date": [
                        "2001-07-09T18:33:32Z"
                    ],
                    "Author": [
                        "cara.semperger@enron.com"
                    ],
                    "Message-From": [
                        "cara.semperger@enron.com"
                    ]

...

Regex

IN PROGRESS-requires a new example in the source gallery

 

Xpath

...

XML

In the code block below, regex is used in a script which will create a metadata field called "organization."  Organization can then be referenced in scripts to create entities and associations.

Code Block
 },        {
            "contentMetadata": [
                {
                    "fieldName": "organization",
                    "script": "believed the (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                    "scriptlang": "regex"
                },
                {
                    "fieldName": "organization",
                    "script": "believed (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                    "scriptlang": "regex"
                },
                {
                    "fieldName": "organization",
                    "script": ".  ([^.]*?)(?: \\([^)]*\\))? claimed responsibility\\.$",
                    "scriptlang": "regex"
                }
            ]
        },

In the code block below, an entity "Who" is created by referencing the metadata field "metadata.organization."

Code Block
},        {
            "entities": [
                {
                    "creationCriteriaScript": "$FUNC( isOrganizationSpecified(); )",
                    "dimension": "Who",
                    "disambiguated_name": "$metadata.organization",
                    "type": "Organization",
                    "useDocGeo": false
                },

 

 

IN PROGRESS-requires a new example in the source gallery

 

...

Xpath

Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

The following example shows how Xpath can be used  to extract embedded HTML from an XML document for the creation of entities and associations.

XML

The example XML data contains some severe weather incident reports.  For each report, we would like to extract the embedded HTML to create entities.

...