Metadata

Document "metadata" contains source-specific information in a source-specific format. It can then be used in source-aware queries, aggregations and visualizations. It is useful as a compromise between taking advantage of Community Edition's "flat" data model (that allows entities and documents from diverse sources to be meaningfully compared) while not losing the original structure (which likely would not have been there if it wasn't useful in the first place!).

Panel

In this section:

Table of Contents

maxLevel	3
indent	16px

Anchor

	Metadata
	Metadata

Metadata can be generated in one of the following ways:

Metadata can be used in a few different ways:

To build entities and events using entities and associations in the source pipeline.
To perform source-specific queries and aggregations
To use in domain-specific widgets/visualizations/code outside of the Community Edition platforms.

Field Guide

The basic format of the "metadata" sub-object is a list of field,value pairs, where the value is always an array (often of size 1) of either atomic types or objects (arbitrarily nested).

Code Block

language	javascript
title	Generic metadata example

{
   // (rest of the document object)
   "metadata": {
      "field1__double": [ 1.0 ], // (single atomic type)
      "field2": [ "1", "2", "3", "4" ], // (array of atomic types)
      "field3": [ { "type": "simple" } ], // (single simple object)
      "field4": [ { "type": { "nested": true } } ], // (single nested object)
      "field3": [ { "type": "simple", "index": 1 }, { "type": { "nested": true }, "index": 2 } ], // (array of objects)
      // etc
   }
}

Info

There is one important subtlety of which to be aware: in the real-time (Lucene) index used for queries, metadata fields across all sources must share the same type. Since metadata field names can be specified on a per source basis, the following steps are taken:

Only 2 types are supported normally: objects (eg see XML metadata below), and strings (ie any "atomic" field)
In the index only, and therefore only applying to free text searches (and in the future custom aggregations), object field names have "__obj" appended. So to search on the "atomic_field" value of the "example" object in a free text query, you would use the Lucene syntax "example__obj.atomic_field: <value>".
Some special field name patterns are supported and if used allow different types to be stored:
- *__double, *__long, *__bool: force the values to the specified types.
- *__dateISO: errors unless the value is a date in ISO format; if it is, stores as a date.
- *__dateTimeJava: errors unless the value is a date in format "MM/dd/yy hh:mm a||MM/dd/yy||MMM dd, yyyy hh:mm:ss a||MMM dd, yyyy"; if it is, stores as a date.
- *__dateYYYMMDD: errors unless the value is a date in format "yyyyMMdd"; if it is, stores as a date.
- *__dateRFC822: errors unless the value is a date in format "EEE, dd MMM yyyy HH:mm:ss Z"; if it is, stores as a date.
- *__dateGMT: errors unless the value is a date in format "dd MMM yyyy HH:mm:ss 'GMT'"; if it is, stores as a date.
- *__discard: does not add the field to the index.
- *__term: indexes the field as a phrase (ie searching for "example" in "field__term":"this is an example" would return no matches; a match would only be returned if the entire field matched).

Info

One other field naming issue to be aware of is the following encodings:

The "." character (which is forbidden in MongoDB field names) is encoded to "%2e"
The "%" character is encoded to "%25"
- (which guarantees that URLDecoder.decode(encoded_metadata_field_name) is the original pre-encoded name)

The remainder of this section describes the different ways in which the metadata can currently be constructed from the source data.

Anchor

	MetadataRSS
	MetadataRSS

Metadata Generated from RSS

Any source-specific metadata in RSS is added under the "_FEED_METADATA" object. For example, the following twitter-specific RSS object:

Code Block

language	html/xml

  <item>
    <title>(TITLE)</title>
    <description>(DESCRIPTION)</description>
    <pubDate>Thu, 26 Apr 2012 20:17:31 +0000</pubDate>
    <link>(URL)</link>
    <twitter:source>&lt;a href=&quot;http://twitter.com/#!/download/iphone&quot; rel=&quot;nofollow&quot;&gt;Twitter for iPhone&lt;/a&gt;</twitter:source>
  </item>

Is rendered like this:

Code Block

language	javascript

{
	"metadata": {		
		"_FEED_METADATA": [{
			"twitter:source": "&lt;a href=&quot;http://twitter.com/#!/download/iphone&quot; rel=&quot;nofollow&quot;&gt;Twitter for iPhone&lt;/a&gt;"
		}]
	}
}

Info
Object/array nesting inside XML is supported and mapped into JSON as you'd expect.

Anchor
MetadataDatabase
MetadataDatabase

Metadata Generated From Databases

Data in (RDBMS) databases are organized into tables, eg:

rowA	rowB	...	rowN
valA1	valB1	...	valN1
valA2	valB2	...	valN2
...	...	...	...
valAm	valBm	...	valNm

The individual values in the database have atomic values (integers, strings, floating point numbers), although they can also be arrays (this is rarely used).

In Community Edition, each row generates a separate document (ie record), as described in the Database extractor. Within these documents, the column names are the metadata fields, and the values are the entries.

If the entries are arrays then they generate multi-value arrays in the JSON; otherwise they generate single-value arrays as described above.

Code Block

language	javascript
title	Generic database metadata

// Document generated from row "n"
{
   // Rest of document, then
   "metadata": {
      "rowA": [ valAn ],
      "rowB": [ valBn ],
      //...
      "rowF": [ valFn_1, valFn_2, ..., valFn_q ], // Example array entry in database
      //...
      "rowN": [ valNn ]
  }
}

The next code block shows an example of a real metadata block, generated from the following database record:

nid	ccn	reportdatetime	shift	offense	method	blocksiteaddress	latitude	longitude	city	state	ward	anc	smd	district	pnc
944913	11001478	"Jan 4, 2011 12:00:00 AM"	"UNK"	"BURGLARY"	2	"600 B/O 8TH ST NE"	"38.89812067433020"	"-76.99496375343240"	"WASHINGTON"	"DC"	6	"6A"	"6A02"	"FIRST"	102

Generates:

Code Block

language	javascript
title	Real metadata object generated from database entry

{
        "metadata" : {
                "nid" : [
                        944913
                ],
                "ccn" : [
                        11001478
                ],
                "reportdatetime" : [
                        "Jan 4, 2011 12:00:00 AM"
                ],
                "shift" : [
                        "UNK"
                ],
                "offense" : [
                        "BURGLARY"
                ],
                "method" : [
                        "2"
                ],
                "blocksiteaddress" : [
                        "600 B/O 8TH ST NE"
                ],
                "latitude" : [
                        "38.89812067433020"
                ],
                "longitude" : [
                        "-76.99496375343240"
                ],
                "city" : [
                        "WASHINGTON"
                ],
                "state" : [
                        "DC"
                ],
                "ward" : [
                        6
                ],
                "anc" : [
                        "6A"
                ],
                "smd" : [
                        "6A02"
                ],
                "district" : [
                        "FIRST"
                ],
                "psa" : [
                        102
                ]
        }
}

Anchor

	MetadataOffice
	MetadataOffice

Metadata Generated from "Documents" (PDF, doc, docx, ppt, pptx)

"Office" documents can generate various metadata fields. They are contained in an object called "_FILE_METADATA". Examples include:

"title"
"Author"
"Creation-Date"
"Original-Date"
"Last-Modified"
"latitude"
"longitude"

See Tika (the underlying technology - eg here) for a more complete list.

Anchor

	MetadataXml
	MetadataXml

Metadata Generated from XML

XML documents can be very complex, containing arbitrary levels of nesting. Also, it is not possible without the XML specification to know what type the fields are.

Aside from this field type issue, XML documents can always be converted into JSON objects, with repeated fields turned into arrays. The typing issue is worked around by treating everything as strings.

For example,

Code Block

language	xml

<root>
   <value1>1</value>
   <object>
      <nested1>string</nested1>
      <nested1>-2</nested1>
      <nested_object>
           <nested11>1.0</nested11>
      </nested_object>
   </object>
<root>

Can be converted into the following JSON object:

Code Block

language	javascript

{
   "value1": "1",
   "object": {
      "nested1": [ "string", "-2" ],
      "nested_object": {
         "nested11": "1.0"
      }
   }
}

And then it is clear how this can be mapped into the document metadata:

Code Block

language	javascript

{
   // (Rest of document)
   "metadata": {
      "value1": [ "1" ],
      "object": [ {
         "nested1": [ "string", "-2" ],
         "nested_object": {
            "nested11": "1.0"
         }
      } ]
   }
}

There is one further subtlety worth noting. Often in XML documents, lists are nested, eg:

Code Block

language	xml

<root>
   <elementList>
      <element>value1</element>
      <element><nested>value2</nested></element>
      <element>value3</element>
   </elementList>
<root>

This would get converted into the following metadata object:

Code Block

language	javascript

{
   // (Rest of document)
   "metadata": {
      "elementList": [ {
         "element": [ "value1", { "nested": "value2" }, "value3" ],
       } ]
   }
}

Aside from the unnecessary extra level of nesting, the double array is ungainly. The XML extraction configuration allows specified XML elements to be ignored, eg nested lists such as "elementList" in the above example, resulting in the much more palatable:

Code Block

language	javascript

{
   // (Rest of document)
   "metadata": {
      "element": [ "value1", { "nested": "value2" }, "value3" ],
   }
}

Here is a real-world example of the metadata generated by a complex XML object:

Code Block

language	javascript
title	Metadata generated from an XML object

{
        "metadata" : {
                "summary" : [
                        "On 7 May 2004, in Nicosia, Cyprus, three small bombs exploded at the facility of the Cyprus Media Group, causing only minor damage and no casualties.  No group claimed responsibility."
                ],
                "perpetrator" : [
                        {
                                "nationality" : "Unknown",
                                "characteristic" : "Unknown"
                        }
                ],
                "location" : [
                        {
                                "region" : "Europe",
                                "citystateprovince" : {
                                        "stateprovince" : "Nicosia",
                                        "city" : "Nicosia"
                                },
                                "country" : "Cyprus"
                        }
                ],
                "subject" : [
                        "Newspaper offices damaged in bombing in Nicosia, Cyprus"
                ],
                "icn" : [
                        "200460104"
                ],
                "multipledays" : [
                        "No"
                ],
                "incidentdate" : [
                        "05/07/2004"
                ],
                "ied" : [
                        "No"
                ],
                "facility" : [
                        {
                                "indicator" : "Targeted",
                                "nationality" : "Cyprus",
                                "targetedcharacteristic" : "Unknown",
                                "definingcharacteristic" : "Unknown",
                                "damage" : "Light",
                                "combatant" : "No",
                                "quantity" : "1",
                                "facilitytype" : "Business"
                        }
                ],
                "approximatedate" : [
                        "No"
                ],
                "assassination" : [
                        "No"
                ],
                "weapontype" : [
                        "Explosive"
                ],
                "eventtype" : [
                        "Bombing"
                ],
                "suicide" : [
                        "No"
                ]
        },
}

Anchor
MetadataJSON
MetadataJSON

Metadata Generated from JSON

The JSON element is written to "metadata.json[0]".

Anchor
MetadataCSV
MetadataCSV

Metadata generated from Single Line Records

The fields from the single line records are written into "metadata.csv[0]" - For more information, see File extractor.

Anchor

	MetadataRegex
	MetadataRegex

Metadata Generated Using "Document Metadata"

You can use the source pipeline to specify "regex" expressions or Javascript to set the document metadata fields (eg title, description, publishedDate), by extracting strings from the text of documents.

For more information, see Document metadata and Content metadata

Panel

Related Documents:

Source Pipeline Documentation

Tika

Versions Compared

Old Version 14

New Version Current

Key

Metadata

Field Guide

Metadata Generated from RSS

Anchor
MetadataDatabase
MetadataDatabase

Metadata Generated From Databases

Metadata Generated from "Documents" (PDF, doc, docx, ppt, pptx)

Metadata Generated from XML

Anchor
MetadataJSON
MetadataJSON

Metadata Generated from JSON

Anchor
MetadataCSV
MetadataCSV

Metadata generated from Single Line Records

Metadata Generated Using "Document Metadata"

Page Comparison

Versions Compared

Old Version 14

New Version Current

Key

Metadata

Field Guide

Metadata Generated from RSS

AnchorMetadataDatabaseMetadataDatabase

Metadata Generated From Databases

Metadata Generated from "Documents" (PDF, doc, docx, ppt, pptx)

Metadata Generated from XML

AnchorMetadataJSONMetadataJSON

Metadata Generated from JSON

AnchorMetadataCSVMetadataCSV

Metadata generated from Single Line Records

Metadata Generated Using "Document Metadata"

Anchor
MetadataDatabase
MetadataDatabase

Anchor
MetadataJSON
MetadataJSON

Anchor
MetadataCSV
MetadataCSV