Document JSON format

Documents

The document is one of the key objects that is returned in response to the query of the Community Edition platform.

Documents contain their own fields as well as sub-objects such as entities, associations, metadata and query enrichment.

In this section:

Document object format
{    
    "data": [        
        {
            // Administrative metadata:
            "_id": "string", // A unique and immutable ID for the document (DB-side this changes with every update, updateId contains the invariant value)
		    "updateId": "string", // (DB-side: this is the immutable ID copied to "_id" when accessed via the API)
            "created": string, // A Java format data representing when the document was harvested 
            "modified": string, // A Java format data representing when the document was last modified harvested (eg by an update)

            // Content metadata:
            "title": string, //The title of the document, as set by RSS (RSS feeds), or source-specified for files/databases
            "description": string, //A summary of the document, as set by RSS (RSS feeds), or source-specified for files/databases
            "url": string, //The location of the document. For feed types other than RSS, its specification is described below
			"displayUrl": string, // Optional: an alternative URL (playing no functional role), should be used instead of the url for display when present
            "sourceUrl": string, // For files containing large numbers of documents, this is the URL of the master file (see below)
            "publishedDate": string, //A Java format date representing when the content was published (not when it was added to Infinit.e)
            "docGeo": {
                "lat": number, // 0+, the "docGeo" object contains the lat and long in degrees of tagged documents. Docs can be tagged using GeoRSS or in source-specific ways
                "lon": number
            },

			// Content:
			"fullText": string, // Not normally present - see below under "Document Content" sub-heading
            // Source metadata:
            "source": [ string ], // The "description" field of the document's parents in "sources" (normally just one entry)
            "sourceKey": [ string ], // The "key" field of the document's parents in "sources" (normally just one entry)
            "sourceType": string, // The source type from which the document was harvested: "feed" (eg RSS), "file", or "db"
            "mediaType": [ string ], // The "mediaType" fields of the documents' parent in "sources" (normally just one entry)
            "tags": [ string ], // The "tags" field of the document's parent or parents in "sources" (combined if more than one) ... INDEXED AS LOWER CASE (stored in whatever case they were specified)
            "communityId": [ string ], // A list of "_id" fields from the communities for which this document has been harvested (see below, normally just one entry)
            "index": string, // (Ignore, for internal use only: the real-time (Lucene) index in which this document matched the query)

            // Sub-objects:
            "entities": [
                { ... } // 0+ Entity objects, see link below
            ],
            "associations": [
                { ... } // 0+ Association objects, see link below
            ],
            "metadata": { ... }, // 0-1, The metadata object, see link below

            // Query enrichment:
            "aggregateSignif": number, // Per-query normalized significance, see below for link to basic scoring documentation
            "queryRelevance": number, // Per-query normalized (Lucene) relevance, see below
            "score": number // Per-query combined normalized significance/relevance scores, see below (this is the field used to rank documents)
        }
    ]
}

Field Guide

Administrative Metadata

There is one non-obvious characteristic of the "created" and "modified" fields, applying only to documents generated from file shares. In these cases, "created" is the date when the harvest occurred, and "modified" is the file time (which can obviously be earlier).

Source Metadata

The source configuration mentioned in the description of the "source", "sourceUrl", "sourceKey", "mediaType", and "tags" fields is here. To summarize the discussion there about "url" and "sourceUrl":

  • "sourceUrl" is only used when a single file in a share pointed to by the File extractor can contain multiple documents.
  • In this case, the "sourceUrl" points to the file itself (ie many documents will have the same "sourceUrl").
  • In this case, the "url" is by default set to "sourceUrl" + "/" + md5sum("metadata") - ie an approximately unique "url" that has no locational meaning and is just present for deduplication purposes. It is far preferable to use the "XmlSourceName" and "XmlPrimaryKey" fields in the File extractor sub-object of the source pipeline configuration.
    • "XmlSourceName" and "XmlPrimaryKey" can also be used for JSON, in exactly the same way.

The following fields are stored as strings and only converted to arrays when retrieved from queries:

  • source, sourceKey, mediaType, communityId

The "communityId" field is a list of community IDs, taken from the "communityIds" field of the "source" object. Normally it will just consist of a single ID, but it can also be a list, where:

  • The document has been harvested by multiple sources in different communities
  • The sources have the same configuration, ie generate the same document (they might for example, use different entity extractors or post-processing and this generates very different document metadata)

The following "mediaType" values are currently used: "News", "Video", "Imagery", "Social", "Discussion", "Blog", "Record" (ie database record), "Report", "Intel" (essentially the same as "Report"). In practice "mediaType" is a freeform string, but it is recommended to try to restrict and manage possible values.

Document Content

Normally, the "fullText" field is discarded from the document metadata and instead stored in the "gzip_content" collection of the "doc_content" database in the format described here. There are some exceptions to this (eg database records) - this is also described here.

The "Knowledge - Document - Get" API call has a "return FullText" option that will re-inject the full text where available.

Content Metadata

 "displayUrl" (which can be set in the Document metadata element of the source pipeline) is guaranteed not to be used by the CE platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Community Edition GUI is as follows:

  • If it starts with "http://" then it is treated as a web link
  • Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document  - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).

Query enrichment

The scoring algorithms used to generate the significance, relevance, and aggregate scores for documents are discussed here. An average score is 100.

Example Documents

HTML document from RSS - entities and events disable, no metadata
{
      "_id": "4ddfd53f2ba07f612a69260d",
      "communityId": [ "4c927585d591d31d7b37097a" ],
      "created": "Wed May 25 07:04:00 EDT 2011",
      "description": "Testimony of National Cybersecurity and Communications Integration Center Director Sean McGurk, NPPD, before the House Committee on Oversight and Government Reform, Subcommittee on National Security, Homeland Defense and Foreign Operations, \"Cybersecurity: Assessing the Immediate Threat to The United States\"",
      "index": "doc_4c927585d591d31d7b37097a",
      "mediaType": "News",
      "modified": "Wed May 25 07:04:00 EDT 2011",
      "publishedDate": "Wed May 25 07:00:00 EDT 2011",
      "source": [ "DHS: National Cyber Security Division" ],
      "sourceKey": [ "http.www.dhs.gov.feeds.press_room.xml" ],
      "tags": [
          "industry:technology",
          "news",
          "topic:technology"
       ],
      "title": "Testimony of National Cybersecurity and Communications Integration Center Director Sean McGurk, NPPD, before the House Committee on Oversight and Government Reform, Subcommittee on National Security, Homeland Defense and Foreign Operations, \"Cybersecurity: Assessing the Immediate Threat to The U.S.",
      "url": "http://www.dhs.gov/ynews/testimony/testimony_1306421842051.shtm",
      "aggregateSignif": 102.87293273096262,
      "queryRelevance": 100.0012728510481,
       "score": 101.69357226562605
}

Entities and Associations

Example entities and associations are shown in the linked pages.

Document from fileshare - entities and events disabled, no metadata
{
    "_id": "4e028d22a0ec7d6ef60da760",
    "associations": [],
    "communityId": [ "4db5c05fb246d25364aceca0" ],
    "created": "Wed Sep 17 19:39:00 EDT 2010",
    "description": "171939Z SEP 08\r\nBAGHDAD: CAR BOMBS ACROSS IRAQ, INCLUDING AT LEAST ONE NEAR THE POLISH EMBASSY IN BAGHDAD, KILLED [...]",
    "docGeo": {
        "lat": 33.37586144623292,
        "lon": 44.47143713123417
    },
    "index": "doc_4db5c05fb246d25364aceca0",
    "mediaType": "Intel",
    "modified": "Wed Sep 16 19:39:00 EDT 2010",
    "publishedDate": "Wed Sep 17 19:39:00 EDT 2008",
    "sourceKey": [ "smb.fileshare.local.139.modus_input." ],
    "tags": [
        "Modus",
        "IED",
        "Intel",
        "extraction",
        "Iraq"
    ],
    "title": "vt123.kl",
    "url": "smb://fileshare.local:139/modus_input/inprocess/vt123.kl",
    "aggregateSignif": 176.07324737430665,
    "queryRelevance": 99.99991570637258,
    "score": 150.73734148607468

Example entities and associations are shown in the linked pages.

Generated document from database - entities, associations, and metadata disabled
{
    "_id": "4dcd39d96ab97f61f6dab9a6",
    "communityId": [ "4c927585d591d31d7b37097a" ],
    "created": "Sun Jan 30 00:01:00 EST 2011",
    "description": "Jan 30, 2011 12:00:00 AM: ADW OTHER was reported at the 1200 Block of 1st NW",
    "index": "doc_4c927585d591d31d7b37097a",
    "mediaType": "Record",
    "modified": "Sun Jan 30 00:01:00 EST 2011",
    "publishedDate": "Sun Jan 30 00:00:00 EST 2011",
    "source": [ "data.dc.gov - Crime Incidents (ASP)" ],
    "sourceKey": [ "Bergen.washingtondc.IncidentReport" ],
    "title": "11012990",
    "url": "jdbc:mysql://mysqlserver.local:3306/washingtondc/IncidentReport/956013",
    "aggregateSignif": 99.5614992027838,
    "queryRelevance": 100.0012728510481,
    "score": 99.47491180174625
}

Metadata

Example metadata is shown here.

Generated document from XML file - entities, associations, and metadata disabled
{
    "_id": "4de67aad24757d6e99258a3c",
    "associations": [],
    "communityId": [ "4dd53fb4e40d93afb096c484" ],
    "created": "Wed Sep 17 19:39:00 EDT 2010",
    "description": "On 4 October 2006, late in the morning, near the al-Massudi School in the Camp Sarah neighborhood of central Baghdad, Iraq, assailants detonated two improvised explosive devices (IED) [...]",
    "docGeo": {
        "lat": 33.3386111,
        "lon": 44.3938889
    },
    "index": "doc_4dd53fb4e40d93afb096c484",
    "mediaType": "Report",
    "modified": "Wed Sep 16 19:39:00 EDT 2010",
    "publishedDate": "Wed Oct 04 00:00:00 EDT 2006",
    "source": "NCTC WITS Data",
    "sourceKey": [ "smb.fileshare.local.139.wits." ],
    "sourceUrl": [ "smb://fileshare.local:139/wits/allfiles/WITS_2006_10.xml" ],
    "tags": [
        "incidents",
        "nctc",
        "ied",
        "terrorism",
        "wits",
        "events",
        "worldwide",
        "incident"
    ],
    "title": "3 bodyguards, 18 civilians killed, 15 police officers, 11 bodyguards, 63 civilians wounded in IED and VBIED attacks in Baghdad, Iraq",
    "url": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=200695603",
    "aggregateSignif": 181.3723329799639,
    "queryRelevance": 100.0012728510481,
    "score": 154.28817043245692
}

Note the "url" field, which was constructed out of the XML and provides both guaranteed uniqueness and points to a location hosting the document (the default "url" provides neither of those things).