Metadata JSON format

Metadata

Document "metadata" contains source-specific information in a source-specific format. It can then be used in source-aware queries, aggregations and visualizations. It is useful as a compromise between taking advantage of Community Edition's "flat" data model (that allows entities and documents from diverse sources to be meaningfully compared) while not losing the original structure (which likely would not have been there if it wasn't useful in the first place!).

In this section:

Metadata can be generated in one of the following ways:

Metadata can be used in a few different ways:

Field Guide

The basic format of the "metadata" sub-object is a list of field,value pairs, where the value is always an array (often of size 1) of either atomic types or objects (arbitrarily nested).

Generic metadata example
{
   // (rest of the document object)
   "metadata": {
      "field1__double": [ 1.0 ], // (single atomic type)
      "field2": [ "1", "2", "3", "4" ], // (array of atomic types)
      "field3": [ { "type": "simple" } ], // (single simple object)
      "field4": [ { "type": { "nested": true } } ], // (single nested object)
      "field3": [ { "type": "simple", "index": 1 }, { "type": { "nested": true }, "index": 2 } ], // (array of objects)
      // etc
   }
}

There is one important subtlety of which to be aware: in the real-time (Lucene) index used for queries, metadata fields across all sources must share the same type. Since metadata field names can be specified on a per source basis, the following steps are taken:

  • Only 2 types are supported normally: objects (eg see XML metadata below), and strings (ie any "atomic" field)
  • In the index only, and therefore only applying to free text searches (and in the future custom aggregations), object field names have "__obj" appended. So to search on the "atomic_field" value of the "example" object in a free text query, you would use the Lucene syntax "example__obj.atomic_field: <value>".
  • Some special field name patterns are supported and if used allow different types to be stored:
    • *__double, *__long, *__bool: force the values to the specified types.
    • *__dateISO: errors unless the value is a date in ISO format; if it is, stores as a date.
    • *__dateTimeJava: errors unless the value is a date in format "MM/dd/yy hh:mm a||MM/dd/yy||MMM dd, yyyy hh:mm:ss a||MMM dd, yyyy"; if it is, stores as a date.
    • *__dateYYYMMDD: errors unless the value is a date in format "yyyyMMdd"; if it is, stores as a date.
    • *__dateRFC822: errors unless the value is a date in format "EEE, dd MMM yyyy HH:mm:ss Z"; if it is, stores as a date.
    • *__dateGMT: errors unless the value is a date in format "dd MMM yyyy HH:mm:ss 'GMT'"; if it is, stores as a date.
    • *__discard: does not add the field to the index.
    • *__term: indexes the field as a phrase (ie searching for "example" in "field__term":"this is an example" would return no matches; a match would only be returned if the entire field matched).

One other field naming issue to be aware of is the following encodings:

  • The "." character (which is forbidden in MongoDB field names) is encoded to "%2e"
  • The "%" character is encoded to "%25"
    • (which guarantees that URLDecoder.decode(encoded_metadata_field_name) is the original pre-encoded name)

The remainder of this section describes the different ways in which the metadata can currently be constructed from the source data.

Metadata Generated from RSS

Any source-specific metadata in RSS is added under the "_FEED_METADATA" object. For example, the following twitter-specific RSS object:

  <item>
    <title>(TITLE)</title>
    <description>(DESCRIPTION)</description>
    <pubDate>Thu, 26 Apr 2012 20:17:31 +0000</pubDate>
    <link>(URL)</link>
    <twitter:source>&lt;a href=&quot;http://twitter.com/#!/download/iphone&quot; rel=&quot;nofollow&quot;&gt;Twitter for iPhone&lt;/a&gt;</twitter:source>
  </item>

Is rendered like this:

{
	"metadata": {		
		"_FEED_METADATA": [{
			"twitter:source": "&lt;a href=&quot;http://twitter.com/#!/download/iphone&quot; rel=&quot;nofollow&quot;&gt;Twitter for iPhone&lt;/a&gt;"
		}]
	}
}

Object/array nesting inside XML is supported and mapped into JSON as you'd expect.

 

Metadata Generated From Databases

Data in (RDBMS) databases are organized into tables, eg:

rowA

rowB

...

rowN

valA1

valB1

...

valN1

valA2

valB2

...

valN2

...

...

...

...

valAm

valBm

...

valNm

The individual values in the database have atomic values (integers, strings, floating point numbers), although they can also be arrays (this is rarely used).

In Community Edition, each row generates a separate document (ie record), as described in the Database extractor. Within these documents, the column names are the metadata fields, and the values are the entries.

If the entries are arrays then they generate multi-value arrays in the JSON; otherwise they generate single-value arrays as described above.

Generic database metadata
// Document generated from row "n"
{
   // Rest of document, then
   "metadata": {
      "rowA": [ valAn ],
      "rowB": [ valBn ],
      //...
      "rowF": [ valFn_1, valFn_2, ..., valFn_q ], // Example array entry in database
      //...
      "rowN": [ valNn ]
  }
}

The next code block shows an example of a real metadata block, generated from the following database record:

nid

ccn

reportdatetime

shift

offense

method

blocksiteaddress

latitude

longitude

city

state

ward

anc

smd

district

pnc

944913

11001478

"Jan 4, 2011 12:00:00 AM"

"UNK"

"BURGLARY"

2

"600 B/O 8TH ST NE"

"38.89812067433020"

"-76.99496375343240"

"WASHINGTON"

"DC"

6

"6A"

"6A02"

"FIRST"

102

Generates:

Real metadata object generated from database entry
{
        "metadata" : {
                "nid" : [
                        944913
                ],
                "ccn" : [
                        11001478
                ],
                "reportdatetime" : [
                        "Jan 4, 2011 12:00:00 AM"
                ],
                "shift" : [
                        "UNK"
                ],
                "offense" : [
                        "BURGLARY"
                ],
                "method" : [
                        "2"
                ],
                "blocksiteaddress" : [
                        "600 B/O 8TH ST NE"
                ],
                "latitude" : [
                        "38.89812067433020"
                ],
                "longitude" : [
                        "-76.99496375343240"
                ],
                "city" : [
                        "WASHINGTON"
                ],
                "state" : [
                        "DC"
                ],
                "ward" : [
                        6
                ],
                "anc" : [
                        "6A"
                ],
                "smd" : [
                        "6A02"
                ],
                "district" : [
                        "FIRST"
                ],
                "psa" : [
                        102
                ]
        }
}

Metadata Generated from "Documents" (PDF, doc, docx, ppt, pptx)

"Office" documents can generate various metadata fields. They are contained in an object called "_FILE_METADATA". Examples include:

  • "title"
  • "Author"
  • "Creation-Date"
  • "Original-Date"
  • "Last-Modified"
  • "latitude"
  • "longitude"

See Tika (the underlying technology - eg here) for a more complete list.

Metadata Generated from XML

XML documents can be very complex, containing arbitrary levels of nesting. Also, it is not possible without the XML specification to know what type the fields are.

Aside from this field type issue, XML documents can always be converted into JSON objects, with repeated fields turned into arrays. The typing issue is worked around by treating everything as strings.

For example,

<root>
   <value1>1</value>
   <object>
      <nested1>string</nested1>
      <nested1>-2</nested1>
      <nested_object>
           <nested11>1.0</nested11>
      </nested_object>
   </object>
<root>

Can be converted into the following JSON object:

{
   "value1": "1",
   "object": {
      "nested1": [ "string", "-2" ],
      "nested_object": {
         "nested11": "1.0"
      }
   }
}

And then it is clear how this can be mapped into the document metadata:

{
   // (Rest of document)
   "metadata": {
      "value1": [ "1" ],
      "object": [ {
         "nested1": [ "string", "-2" ],
         "nested_object": {
            "nested11": "1.0"
         }
      } ]
   }
}

There is one further subtlety worth noting. Often in XML documents, lists are nested, eg:

<root>
   <elementList>
      <element>value1</element>
      <element><nested>value2</nested></element>
      <element>value3</element>
   </elementList>
<root>

This would get converted into the following metadata object:

{
   // (Rest of document)
   "metadata": {
      "elementList": [ {
         "element": [ "value1", { "nested": "value2" }, "value3" ],
       } ]
   }
}

Aside from the unnecessary extra level of nesting, the double array is ungainly. The XML extraction configuration allows specified XML elements to be ignored, eg nested lists such as "elementList" in the above example, resulting in the much more palatable:

{
   // (Rest of document)
   "metadata": {
      "element": [ "value1", { "nested": "value2" }, "value3" ],
   }
}

Here is a real-world example of the metadata generated by a complex XML object:

Metadata generated from an XML object
{
        "metadata" : {
                "summary" : [
                        "On 7 May 2004, in Nicosia, Cyprus, three small bombs exploded at the facility of the Cyprus Media Group, causing only minor damage and no casualties.  No group claimed responsibility."
                ],
                "perpetrator" : [
                        {
                                "nationality" : "Unknown",
                                "characteristic" : "Unknown"
                        }
                ],
                "location" : [
                        {
                                "region" : "Europe",
                                "citystateprovince" : {
                                        "stateprovince" : "Nicosia",
                                        "city" : "Nicosia"
                                },
                                "country" : "Cyprus"
                        }
                ],
                "subject" : [
                        "Newspaper offices damaged in bombing in Nicosia, Cyprus"
                ],
                "icn" : [
                        "200460104"
                ],
                "multipledays" : [
                        "No"
                ],
                "incidentdate" : [
                        "05/07/2004"
                ],
                "ied" : [
                        "No"
                ],
                "facility" : [
                        {
                                "indicator" : "Targeted",
                                "nationality" : "Cyprus",
                                "targetedcharacteristic" : "Unknown",
                                "definingcharacteristic" : "Unknown",
                                "damage" : "Light",
                                "combatant" : "No",
                                "quantity" : "1",
                                "facilitytype" : "Business"
                        }
                ],
                "approximatedate" : [
                        "No"
                ],
                "assassination" : [
                        "No"
                ],
                "weapontype" : [
                        "Explosive"
                ],
                "eventtype" : [
                        "Bombing"
                ],
                "suicide" : [
                        "No"
                ]
        },
}

Metadata Generated from JSON

The JSON element is written to "metadata.json[0]".

Metadata generated from Single Line Records

The fields from the single line records are written into "metadata.csv[0]" - For more information, see File extractor.

Metadata Generated Using "Document Metadata"

You can use the source pipeline to specify "regex" expressions or Javascript to set the document metadata fields (eg title, description, publishedDate), by extracting strings from the text of documents.

For more information, see Document metadata and Content metadata