Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • "sourceUrl" is only used when a single file in a share pointed to by the File Harvester can extractor can contain multiple documents.
  • In this case, the "sourceUrl" points to the file itself (ie many documents will have the same "sourceUrl").
  • In this case, the "url" is by default set to "sourceUrl" + "/" + md5sum("metadata") - ie an approximately unique "url" that has no locational meaning and is just present for deduplication purposes. It is far preferable to use the "XmlSourceName" and "XmlPrimaryKey" fields in the File extractor sub-object of the source pipeline configuration.
    • "XmlSourceName" and "XmlPrimaryKey" can also be used for JSON, in exactly the same way.

...

Conversely, "displayUrl" (which can be set in the structured analysis handlerthe Document metadata element of the source pipeline) is guaranteed not to be used by the Infinit.e CE platform. It is therefore useful for linking documents to external content. For reference, the way that it is used in the Infinit.e GUI Community Edition GUI is as follows:

  • If it starts with "http://" then it is treated as a web link
  • Otherwise, it is assumed to be a relative file path to the fileshare specified in the source url field. (eg you can use the "Document  - File - Get" call with the "sourceKey" concatenated to the "displayUrl" to retrieve the file directly from the fileshare).

The "communityId" field is a list of community IDs, taken from the "communityIds" field of the "source" object. Normally it will just consist of a single ID, but it can also be a list, where:

  • The document has been harvested by multiple sources in different communities
  • The sources have the same configuration, ie generate the same document (they might for example, use different entity extractors or post-processing and this generate generates very different document metadata)

...

The scoring algorithms used to generate the significance, relevance, and aggregate scores for documents are discussed here. An average score is 100.

Example

...

Documents

Code Block
languagejavascript
titleHTML document from RSS - entities and events disable, no metadata
{
      "_id": "4ddfd53f2ba07f612a69260d",
      "communityId": [ "4c927585d591d31d7b37097a" ],
      "created": "Wed May 25 07:04:00 EDT 2011",
      "description": "Testimony of National Cybersecurity and Communications Integration Center Director Sean McGurk, NPPD, before the House Committee on Oversight and Government Reform, Subcommittee on National Security, Homeland Defense and Foreign Operations, \"Cybersecurity: Assessing the Immediate Threat to The United States\"",
      "index": "doc_4c927585d591d31d7b37097a",
      "mediaType": "News",
      "modified": "Wed May 25 07:04:00 EDT 2011",
      "publishedDate": "Wed May 25 07:00:00 EDT 2011",
      "source": [ "DHS: National Cyber Security Division" ],
      "sourceKey": [ "http.www.dhs.gov.feeds.press_room.xml" ],
      "tags": [
          "industry:technology",
          "news",
          "topic:technology"
       ],
      "title": "Testimony of National Cybersecurity and Communications Integration Center Director Sean McGurk, NPPD, before the House Committee on Oversight and Government Reform, Subcommittee on National Security, Homeland Defense and Foreign Operations, \"Cybersecurity: Assessing the Immediate Threat to The U.S.",
      "url": "http://www.dhs.gov/ynews/testimony/testimony_1306421842051.shtm",
      "aggregateSignif": 102.87293273096262,
      "queryRelevance": 100.0012728510481,
       "score": 101.69357226562605
}

Entities and Associations

Example entities and associations are shown in the linked pages.

...

Code Block
languagejavascript
titleGenerated document from database - entities, associations, and metadata disabled
{
    "_id": "4dcd39d96ab97f61f6dab9a6",
    "communityId": [ "4c927585d591d31d7b37097a" ],
    "created": "Sun Jan 30 00:01:00 EST 2011",
    "description": "Jan 30, 2011 12:00:00 AM: ADW OTHER was reported at the 1200 Block of 1st NW",
    "index": "doc_4c927585d591d31d7b37097a",
    "mediaType": "Record",
    "modified": "Sun Jan 30 00:01:00 EST 2011",
    "publishedDate": "Sun Jan 30 00:00:00 EST 2011",
    "source": [ "data.dc.gov - Crime Incidents (ASP)" ],
    "sourceKey": [ "Bergen.washingtondc.IncidentReport" ],
    "title": "11012990",
    "url": "jdbc:mysql://mysqlserver.local:3306/washingtondc/IncidentReport/956013",
    "aggregateSignif": 99.5614992027838,
    "queryRelevance": 100.0012728510481,
    "score": 99.47491180174625
}

Metadata

Example metadata is shown here.

Code Block
languagejavascript
titleGenerated document from XML file - entities, associations, and metadata disabled
{
    "_id": "4de67aad24757d6e99258a3c",
    "associations": [],
    "communityId": [ "4dd53fb4e40d93afb096c484" ],
    "created": "Wed Sep 17 19:39:00 EDT 2010",
    "description": "On 4 October 2006, late in the morning, near the al-Massudi School in the Camp Sarah neighborhood of central Baghdad, Iraq, assailants detonated two improvised explosive devices (IED) [...]",
    "docGeo": {
        "lat": 33.3386111,
        "lon": 44.3938889
    },
    "index": "doc_4dd53fb4e40d93afb096c484",
    "mediaType": "Report",
    "modified": "Wed Sep 16 19:39:00 EDT 2010",
    "publishedDate": "Wed Oct 04 00:00:00 EDT 2006",
    "source": "NCTC WITS Data",
    "sourceKey": [ "smb.fileshare.local.139.wits." ],
    "sourceUrl": [ "smb://fileshare.local:139/wits/allfiles/WITS_2006_10.xml" ],
    "tags": [
        "incidents",
        "nctc",
        "ied",
        "terrorism",
        "wits",
        "events",
        "worldwide",
        "incident"
    ],
    "title": "3 bodyguards, 18 civilians killed, 15 police officers, 11 bodyguards, 63 civilians wounded in IED and VBIED attacks in Baghdad, Iraq",
    "url": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=200695603",
    "aggregateSignif": 181.3723329799639,
    "queryRelevance": 100.0012728510481,
    "score": 154.28817043245692
}

...

Info

Note the "url" field, which was constructed out of the XML and provides both guaranteed uniqueness and points to a location hosting the document (the default "url" provides neither of those things).

Document Content

Normally, the "fullText" field is discarded from the document metadata and instead stored in the "gzip_content" collection of the "doc_content" database in the format described here. There are some exceptions to this (eg database records) - this is also described here.

...