Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This toolkit element allows you to use regex, javascript, or xpath to create metadata objects (that can then be used to generate entities or assocations by subsequent pipeline elements).

This page has been broken down into the following sections for ease of localization.

Table of Contents

 

Format

Code Block
{
	"display": string,
	"contentMetadata": [{
		{} "fieldName":string,// seeAny MetadataSpecPojostring, belowthe 	]key }
//////////////////////////////////
 
	public static class MetadataSpecPojo {
		public String fieldName; // Any string, the key for generated array in "doc.metadata"
		public String scriptlang; for generated array in "doc.metadata"
		"scriptlang":string,// One of "javascript", "regex", "xpath", "stream"
		public String script; "script":string,// The script that will generate the array in "doc.metadata" (under fieldName)
		public String flags; "flags":flags,// Standard Java regex field (regex/xpath only), plus "H" to decode HTML, "D": will allow duplicate strings (by default they are de-duplicated), plus the following custom flags:
								// For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata (_metadata)
								// For xpath/strean: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
		public String replace; "replace":string,// Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc
		public Boolean store; "store":Boolean,// Whether this field should be stored in the DB or discarded after the harvetharvest processing
		public Boolean index; "index":Boolean,// Whether this field should be full-text indexed or just stored in the DB
	},
	...
	]
}

 

Description

The following table describes the parameters of the content metadata configuration.

For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata
 

There are also a few flags that provide additional variables in the javascript:

  • "m" to get "_doc.metadata", written into the variable "_metadata"
    • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
  • "d" to get "_doc", written into the variable "_doc",
  • "t" to return the full text of the document into "text". If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable
    ParameterDescriptionNoteData Types
    fieldName
    Any string, the key for generated array in "doc.metadata"
      
    scriptlang   
    script   
    flags

    javascript, regex, xpath, or stream.

    "stream" provides an efficient parser for splitting up potentially large XML/JSON objects into lots of smaller objects, either return chunks as text or as metadata (if flags is set to "o")

    script

    script that will generate the array - eg the regular expression, xpath string, or JS script.

    Warning

    Note if using xpath - the document is converted to a valid HTML document, ie "html" and "body" nodes are the outer nodes. Therefore even if processing raw XML then the script either needs to start "/html/body" (for the root XML node) or "//" to get any nodes matching the subsequent expression.

    If using stream: the script can be "" or null, in which case the top object is parsed. Otherwise a comma-separated list of (top-level) fields is provided, and each object with that name or in an array of that name is selected (just like with the File extractor)

    flags

    For javascript (defaults to "t" if none specified), "t" the script receives the doc fullText ("text"), "d" the script receives the entire doc (_doc), "m" the script receives the doc.metadata
     

    There are also a few flags that provide additional variables in the javascript:

    • "m" to get "_doc.metadata", written into the variable "_metadata"
      • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
    • "d" to get "_doc", written into the variable "_doc",
    • "t" to return the full text of the document into "text". 
      • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.
    • 'H': HTML code the resulting field (same as for regex/xpath except that it only applied to single value returns. To decode HTML in more complex expressions/arrays, you can always call org.apache.commons.lang.StringEscapeUtils.unescapeHtml('' + js_string)
      

    For For xpath: "o": if the XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array.

    For reference, here is the complete set of flags for xpath (and regex, except for "O"):

    • 'H': will HTML-decode resulting fields. (Eg "&" -> "&")
    • 'o': if  the stream/XPath expression points to an HTML (/XML) object, then this object is converted to JSON and stored as an object in the corresponding metadata field array. (Can also be done via the deprecated "groupNum":-1)
    • 'x': if the XPath expression points to an HTML (/XML) object, then the XML of the object is displayed with no decoding (eg stripping of fields)
    • 'D': described above 
    • 'c': if set then fields with the same name are chained together (otherwise they will all append their results to the field within metadata)
      
    replace   
    store   
    index   

    Supported Script Languages

    Javascript

    replace

    Replacement string for regex/xpath+regex matches, can include capturing groups as $1 etc.

    store

    Whether this field should be stored in the DB or discarded after the harvest processing.

    index

    Whether this field should be full-text indexed or just stored in the DB.

    Examples

    Javascript

    For power users, metadata can be generated from the content using javascript. This gives a huge amount of flexibility to apply site/source-specific knowledge to pull out metadata that can be turned into entities or associations.

    Info

    If there are multiple "meta" objects with the same "fieldName", then they form a "pipeline", with each new object taking the old array, in the "_iterator" variable, and then overwriting the previous entry's result.

    There are also a few flags that provide additional variables in the javascript:

    • "m" to get "_doc.metadata", written into the variable "_metadata"
      • (for example this flag can be used to copy a subset of the fields from one fieldname to another, before using the "metadataFields" field in the "structuredAnalysis" object to delete the larger field)
    • "d" to get "_doc", written into the variable "_doc",
    • "t" to return the full text of the document into "text". 
      • If the "flags" field is not specified, this is returned by default. If the "flags" field is specified, then "t" must be included or the "text" variable is not populated.

    XML

    ...


    In the following example, the "contentMetadata" block uses javascript to convert the XML file data into metadata. Normally "docMetadata"/"entities"/"associations" block would finally be used to set the per-document titles, descriptions, entities etc.

    ...

    Code Block
    {
        "description": "wiy",
        "isPublic": true,
        "mediaType": "News",
        "tags": [
            "tag1"
        ],
        "title": "aaa xml test",
        "processingPipeline": [
            {
                "feed": {
                    "extraUrls": [
                        {
                            "url": "http://www.w3schools.com/xml/simple.xml"
                        }
                    ],
                    "updateCycle_secs": 86400
                }
            },
            {
                "links": {
                    "extraMeta": [
                        {
                            "context": "First",
                            "fieldName": "convert_to_json",
                            "flags": "o",
                            "script": "//breakfast_menu/food[*]",
                            "scriptlang": "xpath"
                        }
                    ],
                    "script": "function convert_to_docs(jsonarray, url)\n{\n    var docs = [];\n    for (var docIt in jsonarray) {\n        var predoc = jsonarray[docIt];\n        delete predoc.content;\n        var doc = {};\n        doc.url = _doc.url.replace(/[?].*/,\"\") + '#' + docIt;\n        doc.fullText = predoc;\n        doc.title = \"TBD\";\n        doc.description = \"TBD\";\n        docs.push(doc);\n    }\n    return docs;\n}\nvar docs = convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                    "scriptflags": "d"
                }
             convert_to_docs(_doc.metadata['convert_to_json'], _doc.url);\ndocs;",
                    "scriptflags": "d"
                }
            },
            {
                "contentMetadata": [
                    {
                        "fieldName": "json",
                        "script": "var json = eval('('+text+')'); json;",
                        "scriptlang": "javascript"
                    }
                ]
            }
        ]
    }

    Office Document

    In the following example, the contentMetadata block has been configured to specify a javascript that will create a metadata entity called "email_meta."  Email_meta will report some meta data values for the "office" email type.

    Code Block
     },        {
                "contentMetadata": [
                    {
                        "fieldName": "email_meta",
                        "script": "var x=_metadata._FILE_METADATA_[0].metadata;x;",
                        "scriptlang": "javascript",
                        "flags": "m"
                    }
                ]
            },

     

    In the sample output, we can see the new metadata entity "email_meta" which has been created by the contentMetadata block.

    Code Block
     ],        "email_meta": [
                [
                    {
                        "Creation-Date": [
                            "2001-07-09T18:33:32Z"
                        ],
                        "Message-To": [
                            "will.smith@enron.com"
                        ],
                        "Content-Type": [
                            "message/rfc822"
                        ],
                        "subject": [
                            "RE: Testing Preschedule workspace"
                        ],
                        "date": [
                            "2001-07-09T18:33:32Z"
                        ],
                        "Author": [
                            "cara.semperger@enron.com"
                        ],
                        "Message-From": [
                            "cara.semperger@enron.com"
                        ]

    ...

    Regex

    XML

    In the code block below, regex is used in a script which will create a metadata field called "organization."  Organization can then be referenced in scripts to create entities and associations.

    Code Block
     },        {
                "contentMetadata": [
                    {
                        "fieldName": "organization",
                        "script": "believed the (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                        "scriptlang": "regex"
                    },
                    {
                        "fieldName": "organization",
                        "script": "believed (.*?)(?: \\([^)]*\\))? (was|were) responsible",
                        "scriptlang": "regex"
                    },
                    {
                        "fieldName": "organization",
                        "script": ".  ([^.]*?)(?: \\([^)]*\\))? claimed responsibility\\.$",
                        "scriptlang": "regex"
                    }
                ]
            },

    In the code block below, an entity "Who" is created by referencing the metadata field "metadata.organization."

    Code Block
    },        {
                "entities": [
                    {
                        "creationCriteriaScript": "$FUNC( isOrganizationSpecified(); )",
                        "dimension": "Who",
                        "disambiguated_name": "$metadata.organization",
                        "type": "Organization",
                        "useDocGeo": false
                    },

     

    ...

    Xpath

    Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

    The following example shows how Xpath can be used  to extract embedded HTML from an XML document for the creation of entities and associations.

    XML

    The example XML data contains some severe weather incident reports.  For each report, we would like to extract the embedded HTML to create entities.

    Code Block
             <![CDATA[
    <table>
    <tr><td>ztime: </td><td>2012-10-29T00:32:00Z</td></tr>
    <tr><td>id: </td><td>1553661</td></tr>
    <tr><td>event: </td><td>NON-TSTM WND GST</td></tr>
    <tr><td>magnitude: </td><td>53.0</td></tr>
    <tr><td>city: </td><td>NORFOLK NAS</td></tr>
    <tr><td>county: </td><td>CITY OF NORFOLK</td></tr>
    <tr><td>state: </td><td>VA</td></tr>
    <tr><td>source: </td><td>ASOS</td></tr>
    </table>
    <iframe src="http://www.ncdc.noaa.gov/swdiws/csv/plsr/id=1553661" />          ]]>
            </description>
          <Point>
                    <coordinates>-76.2800,36.9300,0        </coordinates>
          </Point>
            <TimeSpan>
              <begin>          2012-10-29T00:32:00Z          </begin>
              <end>          2012-10-29T02:56:00Z          </end>
            </TimeSpan>
          </Placemark>
          <Placemark>
            <styleUrl>#style_0</styleUrl>
            <description>
              <![CDATA[
    <table>
    <tr><td>ztime: </td><td>2012-10-29T00:32:00Z</td></tr>
    <tr><td>id: </td><td>1550634</td></tr>
    <tr><td>event: </td><td>NON-TSTM WND GST</td></tr>
    <tr><td>magnitude: </td><td>53.0</td></tr>
    <tr><td>city: </td><td>NORFOLK NAS</td></tr>
    <tr><td>county: </td><td>CITY OF NORFOLK</td></tr>
    <tr><td>state: </td><td>VA</td></tr>
    <tr><td>source: </td><td>ASOS</td></tr>
    </table>
    <iframe src="http://www.ncdc.noaa.gov/swdiws/csv/plsr/id=1550634" />          ]]>
            </description>
          <Point>
                    <coordinates>-76.2800,36.9300,0        </coordinates>
          </Point>
            <TimeSpan>
              <begin>          2012-10-29T00:32:00Z          </begin>
              <end>          2012-10-29T02:56:00Z          </end>
            </TimeSpan>
          </Placemark>
          <Placemark>
            <styleUrl>#style_0</styleUrl>
            <description>

     

    Source

    In the example source below the contentMetadata block is configured to create two metadata fields: "url" and "info."

    Both "url" and "info" will be JSON objects and will be stored in the corresponding metadata field array.

    "url" and "info" can then be used as variables in scripting for entities and associations.  For example doc.metadata.info and doc.metadata.url can be included in scripts using the $SCRIPT convention, in order to create entities such as "Weather", "City", and "Country"

     
    Code Block
     },
            {
                "contentMetadata": [
                    {
                        "fieldName": "jsonurl",
            
               "script": "var json = eval('('+text+')'); json;",      "flags": "o",
                 "scriptlang": "javascript"      "index": false,
             }           "script": "//iframe",
    ]         }     ]
    }

    Office Document

    In the following example, the contentMetadata block has been configured to specify a javascript that will create a metadata entity called "email_meta."  Email_meta will report some meta data values for the "office" email type.

    Code Block
     }      "scriptlang": "xpath",
           {             "contentMetadatastore": [true
                    },
                    {
                        "fieldName": "email_metainfo",
    
                       "script": "var x=_metadata._FILE_METADATA_[0].metadata;x; "flags": "o",
                        "scriptlangindex": "javascript"false,
                        "flagsscript": "m"//table/*",
                     }   "scriptlang": "xpath",
            ]            },

     

    In the sample output, we can see the new metadata entity "email_meta" which has been created by the contentMetadata block.

    Code Block
     ],"store": true
            "email_meta": [       }

     

    Example Entities:

    Code Block
     "display": "",
        [        "entities": [
           {         {
               "Creation-Date": [        "creationCriteriaScript": "$SCRIPT( if(_doc.metadata.info[0].tbody.tr[2].td[1] == null) return false; else return true; )",
          "2001-07-09T18:33:32Z"              "dimension": "What",
         ],               "disambiguated_name": "$SCRIPT(     "Message-To": [return _doc.metadata.info[0].tbody.tr[2].td[1];)",
                             "will.smith@enron.com""type": "Weather",
                        ], "useDocGeo": true
                    },
       "Content-Type": [            {
                "message/rfc822"        "creationCriteriaScript": "$SCRIPT( if(_doc.metadata.info[0].tbody.tr[4].td[1] == null) return false; else return     ]true; )",
                        "subjectdimension": ["What",
                        "disambiguated_name": "$SCRIPT(return   "RE: Testing Preschedule workspace"_doc.metadata.info[0].tbody.tr[4].td[1]; )",
                           ]"type": "City",
                        "dateuseDocGeo": [false
                    },
           "2001-07-09T18:33:32Z"         {
               ],         "creationCriteriaScript": "$SCRIPT( if(_doc.metadata.info[0].tbody.tr[5].td[1] == null) return false; else return true;   "Author": [)",
                        "dimension": "What",
      "cara.semperger@enron.com"                  "disambiguated_name": "$SCRIPT( return _doc.metadata.info[0].tbody.tr[5].td[1];)",
                        "Message-Fromtype": [
         "County",
                      "cara.semperger@enron.com"   "useDocGeo": false
                    ]

    Regex

    IN PROGRESS-requires a new example in the source gallery

     

    Xpath

     

    Neither regex nor javascript are well suited for extracting fields from HTML and XML (particularly since the current Javascript engine, the Java version of Rhino, does not support DOM).

     

    ...

    },

     

     

    Panel

    Footnotes:

    Legacy documentation:

    Legacy documentation:

    ...