Anchor | ||||
---|---|---|---|---|
|
Metadata can be generated in one of the following ways:
- By importing RSS feeds
- By importing database records
- By importing "office style documents" (PDFs, words)
- By importing XML objects
- By importing JSON objects
- BY importing single-line records
- By using contentMetadata
Metadata can be used in a few different ways:
- To build entities and events using entities and associations in the source pipeline.
- To perform source-specific queries and aggregations
- To use in domain-specific widgets/visualizations/code outside of the Community Edition platforms.
Field Guide
The basic format of the "metadata" sub-object is a list of field,value pairs, where the value is always an array (often of size 1) of either atomic types or objects (arbitrarily nested).
Code Block | ||||
---|---|---|---|---|
| ||||
{ // (rest of the document object) "metadata": { "field1__double": [ 1.0 ], // (single atomic type) "field2": [ "1", "2", "3", "4" ], // (array of atomic types) "field3": [ { "type": "simple" } ], // (single simple object) "field4": [ { "type": { "nested": true } } ], // (single nested object) "field3": [ { "type": "simple", "index": 1 }, { "type": { "nested": true }, "index": 2 } ], // (array of objects) // etc } } |
Info |
---|
There is one important subtlety of which to be aware: in the real-time (Lucene) index used for queries, metadata fields across all sources must share the same type. Since metadata field names can be specified on a per source basis, the following steps are taken:
|
Info |
---|
One other field naming issue to be aware of is the following encodings:
|
The remainder of this section describes the different ways in which the metadata can currently be constructed from the source data.
Anchor | ||||
---|---|---|---|---|
|
Metadata Generated from RSS
Any source-specific metadata in RSS is added under the "_FEED_METADATA" object. For example, the following twitter-specific RSS object:
Code Block | ||
---|---|---|
| ||
<item> <title>(TITLE)</title> <description>(DESCRIPTION)</description> <pubDate>Thu, 26 Apr 2012 20:17:31 +0000</pubDate> <link>(URL)</link> <twitter:source><a href="http://twitter.com/#!/download/iphone" rel="nofollow">Twitter for iPhone</a></twitter:source> </item> |
Is rendered like this:
Code Block | ||
---|---|---|
| ||
{ "metadata": { "_FEED_METADATA": [{ "twitter:source": "<a href="http://twitter.com/#!/download/iphone" rel="nofollow">Twitter for iPhone</a>" }] } } |
Info |
---|
Object/array nesting inside XML is supported and mapped into JSON as you'd expect. |
Anchor | ||||
---|---|---|---|---|
|
Metadata Generated From Databases
Data in (RDBMS) databases are organized into tables, eg:
rowA | rowB | ... | rowN |
---|---|---|---|
valA1 | valB1 | ... | valN1 |
valA2 | valB2 | ... | valN2 |
... | ... | ... | ... |
valAm | valBm | ... | valNm |
The individual values in the database have atomic values (integers, strings, floating point numbers), although they can also be arrays (this is rarely used).
In Community Edition, each row generates a separate document (ie record), as described in the Database extractor. Within these documents, the column names are the metadata fields, and the values are the entries.
If the entries are arrays then they generate multi-value arrays in the JSON; otherwise they generate single-value arrays as described above.
Code Block | ||||
---|---|---|---|---|
| ||||
// Document generated from row "n" { // Rest of document, then "metadata": { "rowA": [ valAn ], "rowB": [ valBn ], //... "rowF": [ valFn_1, valFn_2, ..., valFn_q ], // Example array entry in database //... "rowN": [ valNn ] } } |
The next code block shows an example of a real metadata block, generated from the following database record:
nid | ccn | reportdatetime | shift | offense | method | blocksiteaddress | latitude | longitude | city | state | ward | anc | smd | district | pnc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
944913 | 11001478 | "Jan 4, 2011 12:00:00 AM" | "UNK" | "BURGLARY" | 2 | "600 B/O 8TH ST NE" | "38.89812067433020" | "-76.99496375343240" | "WASHINGTON" | "DC" | 6 | "6A" | "6A02" | "FIRST" | 102 |
Generates:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "metadata" : { "nid" : [ 944913 ], "ccn" : [ 11001478 ], "reportdatetime" : [ "Jan 4, 2011 12:00:00 AM" ], "shift" : [ "UNK" ], "offense" : [ "BURGLARY" ], "method" : [ "2" ], "blocksiteaddress" : [ "600 B/O 8TH ST NE" ], "latitude" : [ "38.89812067433020" ], "longitude" : [ "-76.99496375343240" ], "city" : [ "WASHINGTON" ], "state" : [ "DC" ], "ward" : [ 6 ], "anc" : [ "6A" ], "smd" : [ "6A02" ], "district" : [ "FIRST" ], "psa" : [ 102 ] } } |
Anchor | ||||
---|---|---|---|---|
|
Metadata Generated from "Documents" (PDF, doc, docx, ppt, pptx)
"Office" documents can generate various metadata fields. They are contained in an object called "_FILE_METADATA". Examples include:
- "title"
- "Author"
- "Creation-Date"
- "Original-Date"
- "Last-Modified"
- "latitude"
- "longitude"
See Tika (the underlying technology - eg here) for a more complete list.
Anchor | ||||
---|---|---|---|---|
|
Metadata Generated from XML
XML documents can be very complex, containing arbitrary levels of nesting. Also, it is not possible without the XML specification to know what type the fields are.
Aside from this field type issue, XML documents can always be converted into JSON objects, with repeated fields turned into arrays. The typing issue is worked around by treating everything as strings.
For example,
Code Block | ||
---|---|---|
| ||
<root> <value1>1</value> <object> <nested1>string</nested1> <nested1>-2</nested1> <nested_object> <nested11>1.0</nested11> </nested_object> </object> <root> |
Can be converted into the following JSON object:
Code Block | ||
---|---|---|
| ||
{ "value1": "1", "object": { "nested1": [ "string", "-2" ], "nested_object": { "nested11": "1.0" } } } |
And then it is clear how this can be mapped into the document metadata:
Code Block | ||
---|---|---|
| ||
{ // (Rest of document) "metadata": { "value1": [ "1" ], "object": [ { "nested1": [ "string", "-2" ], "nested_object": { "nested11": "1.0" } } ] } } |
There is one further subtlety worth noting. Often in XML documents, lists are nested, eg:
Code Block | ||
---|---|---|
| ||
<root> <elementList> <element>value1</element> <element><nested>value2</nested></element> <element>value3</element> </elementList> <root> |
This would get converted into the following metadata object:
Code Block | ||
---|---|---|
| ||
{ // (Rest of document) "metadata": { "elementList": [ { "element": [ "value1", { "nested": "value2" }, "value3" ], } ] } } |
Aside from the unnecessary extra level of nesting, the double array is ungainly. The XML extraction configuration allows specified XML elements to be ignored, eg nested lists such as "elementList" in the above example, resulting in the much more palatable:
Code Block | ||
---|---|---|
| ||
{ // (Rest of document) "metadata": { "element": [ "value1", { "nested": "value2" }, "value3" ], } } |
Here is a real-world example of the metadata generated by a complex XML object:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "metadata" : { "summary" : [ "On 7 May 2004, in Nicosia, Cyprus, three small bombs exploded at the facility of the Cyprus Media Group, causing only minor damage and no casualties. No group claimed responsibility." ], "perpetrator" : [ { "nationality" : "Unknown", "characteristic" : "Unknown" } ], "location" : [ { "region" : "Europe", "citystateprovince" : { "stateprovince" : "Nicosia", "city" : "Nicosia" }, "country" : "Cyprus" } ], "subject" : [ "Newspaper offices damaged in bombing in Nicosia, Cyprus" ], "icn" : [ "200460104" ], "multipledays" : [ "No" ], "incidentdate" : [ "05/07/2004" ], "ied" : [ "No" ], "facility" : [ { "indicator" : "Targeted", "nationality" : "Cyprus", "targetedcharacteristic" : "Unknown", "definingcharacteristic" : "Unknown", "damage" : "Light", "combatant" : "No", "quantity" : "1", "facilitytype" : "Business" } ], "approximatedate" : [ "No" ], "assassination" : [ "No" ], "weapontype" : [ "Explosive" ], "eventtype" : [ "Bombing" ], "suicide" : [ "No" ] }, } |
Anchor | ||||
---|---|---|---|---|
|
Metadata Generated from JSON
The JSON element is written to "metadata.json[0]".
Anchor | ||||
---|---|---|---|---|
|
Metadata generated from Single Line Records
The fields from the single line records are written into "metadata.csv[0]" - For more information, see File extractor.
Anchor | ||||
---|---|---|---|---|
|
Metadata Generated Using "Document Metadata"
You can use the source pipeline to specify "regex" expressions or Javascript to set the document metadata fields (eg title, description, publishedDate), by extracting strings from the text of documents.
For more information, see Document metadata and Content metadata
Panel |
---|
Related Documents: |