Entity JSON format

Entity Format

As noted in the parent section, the entity format is essentially the same for both entities as sub-objects of the document JSON and also as aggregated objects in their own right (ie the "entities" array in the query reply object). The only differences when they are aggregations are as follows:

  • No "actual_name" field (since there can obviously be many such fields across all documents containing that entity: the Knowledge - Feature - Alias Suggest query call can be used to obtain these "aliases").
  • No "relevance" or "sentiment" statistics, since these are specific to the mentions of an entity in a single document.
  • The "significance" and "frequency" fields are the maximum values occurring in the most relevant subset of matching results (normally the top 1000).

See examples below and the following diagram that helps to clarify the distinction.

 

In this section:


 

Entity object format
{
     // Basic metadata:

     "index": string, // the entity "primary key" within Infinit.e, of the form disambiguated_name.toLowerCase() + "/" + type.toLowerCase()
     "disambiguated_name": string, // for a given "type", this is (aside from case) a unique identifier for the entity
     "actual_name": string, // the most common string for which the entity was seen in the document
     "type": string, // The entity type (see below)
     "dimension": string, // One of "Who" (people, organizations), "Where" (places), or "What" (everything else)

     // Statistics:

     // Statistics - per document
     "relevance": number, // A value between 0 and 1, indicating the entity extraction engine's "opinion" on the entity's relevance within the document
     "frequency": integer, // The number of times the entity occurs in the document
     "sentiment": number, // 0-1, the entity extraction engine's "opinion" on whether the document refers to the entity approvingly (positive, <= 1.0) or disapprovingly (negative, >= -1.0)

     // Statistics - global
     "totalfrequency": long, // The number of times the entity occurs in all documents in the Infinit.e database (currently across all communities, see below)
     "doccount": long, // The number of documents in which the entity occurs in the Infinit.e database (currently across all communities, see below)

     // Statistics - per query, global
     "datasetSignificance": number, // The (approximate) significance of the entity aggregated across all matching documents (see below for link to scoring algorithms)
     "queryCoverage": number, // The (approximate) % of all matching documents in which the entity appears
     "averageFreq": number, // The (approximate) average frequency (including documents in which the entity doesn't appear) across all documents that match the query
     "positiveSentiment": number, // The sum of the positive sentiment counts across the matching documents
     "negativeSentiment": number, // The sum of the negative sentiment counts across the matching documents
     "sentimentCount": long, // The total number of sentiment counts (positive or negative) across the matching documents

     // Statistics - per query, per document
     "significance": number, // The significance of the entity in this document (see below for link to scoring algorithms)

     // Other enrichment:

     "geotag": { // 0-1, only if entity has been geotagged
          "lat": number, // (floating point)
          "lon": number // (floating point)
     },
     "ontology_type": string, // 0-1, only if entity has been geotagged - an OpenCyc type mapped from the "type" field, see below under discussion about types

     "linkdata": [ "string" ] // 0+, A list of useful links relating to the entity (eg Wikipedia entries)
}

Field Guide

Type

The set of values permitted by the "type" field depends on how the entity was extracted:

  • "Commercial" third party entity extractors have a fixed set of types they generate, for example OpenCalais or AlchemyAPI.
  • Many other entity extractors (NetOwl, ModusOperandi are customizable, allowing deployers to add their own entity types.
  • Similarly, using Manual entities, new entity types can easily be added.
    • In general, where custom entities are being created using Manual entities, it would be preferred if source developers took entity types from the OpenCyc repository. There are plans to use OpenCyc more formally in the future.

Ontology Type

Although as noted above there are plans to integrate OpenCyc fully into Infinit.e, currently it is only used formally in one place: the "ontological_type" field that accompanies geotag fields. The purpose of this field is to map from different geotagged entity types into a single hierarchy that can be interpreted both internally (by searches) and externally (by visualization widgets and other follow-on analytics).

Ontological type is discussed further in the Geo JSON format section.

Scoring

The scoring algorithms used to generate the significance, relevance, and aggregate scores for documents are discussed here.

Significance is a % (0 indicating there is no correlation between the entity appearing in the document and the document matching the query, 100 indicating the entity only appears in matching documents). 

The statistics and scoring for entities that span multiple communities follows these rules:

  • Only entity instances from documents in communities over which a search is run (and hence to which a user belongs) are counted (for "doccount", "totalfreq", etc and the derived fields "significance" etc). Therefore there is no leakage of either numeric or textual data across community boundaries except when desired, eg where a user belongs to both and is searching across both.
  • Occasionally, for implementation reasons, statistics for a community will not be available: eg no instances of a particularly entity matched in documents from some community. In these cases, the statistics returned will be estimates.

Note that sentiment is currently only available when AlchemyAPI is used for feature extraction.

Geo Tag

Entity geo-tags are intended to be used to identify the permanent location of an entity; associations' geo-tags should be used to indicate the transient location of an entity. This may not always be the case however (and in fact nothing internal prevents entities from having different geo-tags in each document).

linkdata

The "linkdata" field is an array of HTTP links from The Linked Data Project that provide additional information about common entities, where information is publicly available on the Internet. Currently "linkdata" is only populated when AlchemyAPI or OpenCalais are used as the entity extractor:

  • OpenCalais links to a single web-page which then links to different resources such as Wikipedia, CIA Factbook, etc.
  • AlchemyAPI generates a number of different links, to its different available resources (basically the same set as OpenCalais).

Examples

Examples - as sub-object of document

Entity example, generated by OpenCalais
{
    actual_name: "Atheros Communications Inc."
    dimension: "Who"
    disambiguated_name: "ATHEROS COMMUNICATIONS, INC."
    doccount: 2
    frequency: 2
    index: "atheros communications, inc./company"
    linkdata: [
        "http://d.opencalais.com/er/company/ralg-tr1r/e704df00-837e-3722-9c85-8537d37871d7"
    ]
    relevance: 0.326
    totalfrequency: 3
    type: "Company"
    significance: 0.5527277770637631
    datasetSignificance: 0.5527277770637631
    queryCoverage: 0.004475719993688294
    averageFreq: 0.0023094688221709007
}
Entity example, generated by AlchemyAPI
{
    actual_name: "England"
    dimension: "Where"
    disambiguated_name: "England"
    doccount: 32
    frequency: 57
    index: "england/country"
    linkdata: [
        "http://dbpedia.org/resource/England"
        "http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000014381"
        "http://sw.opencyc.org/concept/Mx4rvViWaZwpEbGdrcN5Y29ycA"
        "http://umbel.org/umbel/ne/wikipedia/England"
        "http://mpii.de/yago/resource/England"
    ]
    relevance: 0.810086
    sentiment: 0.0143614
    totalfrequency: 148
    type: "Country"
    significance: 31.537547455550165
    datasetSignificance: 19.636774116648777
    queryCoverage: 6.067441361627046
    averageFreq: 0.8454545454545455
    positiveSentiment: 0.19462859999999998
    negativeSentiment: -0.43113379999999996
    sentimentCount: 9
},
{
    actual_name: "Cambridge"
    dimension: "Where"
    disambiguated_name: "Cambridge"
    doccount: 274
    frequency: 1
    gazateer_index: "cambridge/city"
    geotag: {
        lat: 52.20805555555555
        lon: 0.1225
    }
    ontology_type: "city"
    linkdata: [
        "http://dbpedia.org/resource/Cambridge"
        "http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000049d17"
        "http://mpii.de/yago/resource/Cambridge"
    ]
    relevance: 0.223275
    sentiment: -0.285151
    totalfrequency: 325
    type: "City"
    significance: 0.3884393421493454
    datasetSignificance: 1.1061519850829167
    queryCoverage: 1.5071120430337133
    averageFreq: 0.01818181818181818
    positiveSentiment: 0
    negativeSentiment: -0.285151
    sentimentCount: 2
}
Entity example, as generated from XML
{
    actual_name: "Indiscriminate/Incidental, Civilian, Adult from Afghanistan"
    dimension: "Who"
    disambiguated_name: "Indiscriminate/Incidental, Civilian, Adult from Afghanistan"
    doccount: 548
    frequency: 41
    index: "indiscriminate/incidental, civilian, adult from afghanistan/victimtype"
    totalfrequency: 2838
    type: "VictimType"
    significance: 9.419613148150061
    datasetSignificance: 6.211531263035471
    queryCoverage: 0.9106727792752375
    averageFreq: 0.131
}

Examples - as standalone aggregation

Entity example (aggregation)
{
	"entities": [
		{
			"index": "linkedin/company",
			"datasetSignificance": 49.24796071489491,
			"sentimentCount": 357,
			"type": "Company",
			"sentiment": 0.155566,
			"totalfrequency": 954,
			"queryCoverage": 99.44289693593316,
			"doccount": 242,
			"dimension": "Who",
			"frequency": 13,
			"negativeSentiment": -0.7647487,
			"positiveSentiment": 51.84659600000009,
			"averageFreq": 4.6155988857938715,
			"significance": 61.000762242212744,
			"actual_name": "LinkedIn",
			"disambiguated_name": "LinkedIn",
			"linkdata": [
				"http://dbpedia.org/resource/LinkedIn",
				"http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000003d3af7",
				"http://umbel.org/umbel/ne/wikipedia/LinkedIn",
				"http://mpii.de/yago/resource/LinkedIn"
			]
		},
		// etc
	],
	//etc
}