Entity JSON format
{ // Basic metadata: "index": string, // the entity "primary key" within Infinit.e, of the form disambiguated_name.toLowerCase() + "/" + type.toLowerCase() "disambiguated_name": string, // for a given "type", this is (aside from case) a unique identifier for the entity "actual_name": string, // the most common string for which the entity was seen in the document "type": string, // The entity type (see below) "dimension": string, // One of "Who" (people, organizations), "Where" (places), or "What" (everything else) // Statistics: // Statistics - per document "relevance": number, // A value between 0 and 1, indicating the entity extraction engine's "opinion" on the entity's relevance within the document "frequency": integer, // The number of times the entity occurs in the document "sentiment": number, // 0-1, the entity extraction engine's "opinion" on whether the document refers to the entity approvingly (positive, <= 1.0) or disapprovingly (negative, >= -1.0) // Statistics - global "totalfrequency": long, // The number of times the entity occurs in all documents in the Infinit.e database (currently across all communities, see below) "doccount": long, // The number of documents in which the entity occurs in the Infinit.e database (currently across all communities, see below) // Statistics - per query, global "datasetSignificance": number, // The (approximate) significance of the entity aggregated across all matching documents (see below for link to scoring algorithms) "queryCoverage": number, // The (approximate) % of all matching documents in which the entity appears "averageFreq": number, // The (approximate) average frequency (including documents in which the entity doesn't appear) across all documents that match the query "positiveSentiment": number, // The sum of the positive sentiment counts across the matching documents "negativeSentiment": number, // The sum of the negative sentiment counts across the matching documents "sentimentCount": long, // The total number of sentiment counts (positive or negative) across the matching documents // Statistics - per query, per document "significance": number, // The significance of the entity in this document (see below for link to scoring algorithms) // Other enrichment: "geotag": { // 0-1, only if entity has been geotagged "lat": number, // (floating point) "lon": number // (floating point) }, "ontology_type": string, // 0-1, only if entity has been geotagged - an OpenCyc type mapped from the "type" field, see below under discussion about types "linkdata": [ "string" ] // 0+, A list of useful links relating to the entity (eg Wikipedia entries) }
Field Guide
Type
The set of values permitted by the "type" field depends on how the entity was extracted:
- "Commercial" third party entity extractors have a fixed set of types they generate, for example OpenCalais or AlchemyAPI.
- Many other entity extractors (NetOwl, ModusOperandi are customizable, allowing deployers to add their own entity types.
- Similarly, using Manual entities, new entity types can easily be added.
- In general, where custom entities are being created using Manual entities, it would be preferred if source developers took entity types from the OpenCyc repository. There are plans to use OpenCyc more formally in the future.
Ontology Type
Although as noted above there are plans to integrate OpenCyc fully into Infinit.e, currently it is only used formally in one place: the "ontological_type" field that accompanies geotag fields. The purpose of this field is to map from different geotagged entity types into a single hierarchy that can be interpreted both internally (by searches) and externally (by visualization widgets and other follow-on analytics).
Ontological type is discussed further in the Geo JSON format section.
Scoring
The scoring algorithms used to generate the significance, relevance, and aggregate scores for documents are discussed here.
Significance is a % (0 indicating there is no correlation between the entity appearing in the document and the document matching the query, 100 indicating the entity only appears in matching documents).
The statistics and scoring for entities that span multiple communities follows these rules:
- Only entity instances from documents in communities over which a search is run (and hence to which a user belongs) are counted (for "doccount", "totalfreq", etc and the derived fields "significance" etc). Therefore there is no leakage of either numeric or textual data across community boundaries except when desired, eg where a user belongs to both and is searching across both.
- Occasionally, for implementation reasons, statistics for a community will not be available: eg no instances of a particularly entity matched in documents from some community. In these cases, the statistics returned will be estimates.
Note that sentiment is currently only available when AlchemyAPI is used for feature extraction.
Geo Tag
Entity geo-tags are intended to be used to identify the permanent location of an entity; associations' geo-tags should be used to indicate the transient location of an entity. This may not always be the case however (and in fact nothing internal prevents entities from having different geo-tags in each document).
linkdata
The "linkdata" field is an array of HTTP links from The Linked Data Project that provide additional information about common entities, where information is publicly available on the Internet. Currently "linkdata" is only populated when AlchemyAPI or OpenCalais are used as the entity extractor:
- OpenCalais links to a single web-page which then links to different resources such as Wikipedia, CIA Factbook, etc.
- AlchemyAPI generates a number of different links, to its different available resources (basically the same set as OpenCalais).
Examples
Examples - as sub-object of document
{ actual_name: "Atheros Communications Inc." dimension: "Who" disambiguated_name: "ATHEROS COMMUNICATIONS, INC." doccount: 2 frequency: 2 index: "atheros communications, inc./company" linkdata: [ "http://d.opencalais.com/er/company/ralg-tr1r/e704df00-837e-3722-9c85-8537d37871d7" ] relevance: 0.326 totalfrequency: 3 type: "Company" significance: 0.5527277770637631 datasetSignificance: 0.5527277770637631 queryCoverage: 0.004475719993688294 averageFreq: 0.0023094688221709007 }
{ actual_name: "England" dimension: "Where" disambiguated_name: "England" doccount: 32 frequency: 57 index: "england/country" linkdata: [ "http://dbpedia.org/resource/England" "http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000014381" "http://sw.opencyc.org/concept/Mx4rvViWaZwpEbGdrcN5Y29ycA" "http://umbel.org/umbel/ne/wikipedia/England" "http://mpii.de/yago/resource/England" ] relevance: 0.810086 sentiment: 0.0143614 totalfrequency: 148 type: "Country" significance: 31.537547455550165 datasetSignificance: 19.636774116648777 queryCoverage: 6.067441361627046 averageFreq: 0.8454545454545455 positiveSentiment: 0.19462859999999998 negativeSentiment: -0.43113379999999996 sentimentCount: 9 }, { actual_name: "Cambridge" dimension: "Where" disambiguated_name: "Cambridge" doccount: 274 frequency: 1 gazateer_index: "cambridge/city" geotag: { lat: 52.20805555555555 lon: 0.1225 } ontology_type: "city" linkdata: [ "http://dbpedia.org/resource/Cambridge" "http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000049d17" "http://mpii.de/yago/resource/Cambridge" ] relevance: 0.223275 sentiment: -0.285151 totalfrequency: 325 type: "City" significance: 0.3884393421493454 datasetSignificance: 1.1061519850829167 queryCoverage: 1.5071120430337133 averageFreq: 0.01818181818181818 positiveSentiment: 0 negativeSentiment: -0.285151 sentimentCount: 2 }
{ actual_name: "Indiscriminate/Incidental, Civilian, Adult from Afghanistan" dimension: "Who" disambiguated_name: "Indiscriminate/Incidental, Civilian, Adult from Afghanistan" doccount: 548 frequency: 41 index: "indiscriminate/incidental, civilian, adult from afghanistan/victimtype" totalfrequency: 2838 type: "VictimType" significance: 9.419613148150061 datasetSignificance: 6.211531263035471 queryCoverage: 0.9106727792752375 averageFreq: 0.131 }
Examples - as standalone aggregation
{ "entities": [ { "index": "linkedin/company", "datasetSignificance": 49.24796071489491, "sentimentCount": 357, "type": "Company", "sentiment": 0.155566, "totalfrequency": 954, "queryCoverage": 99.44289693593316, "doccount": 242, "dimension": "Who", "frequency": 13, "negativeSentiment": -0.7647487, "positiveSentiment": 51.84659600000009, "averageFreq": 4.6155988857938715, "significance": 61.000762242212744, "actual_name": "LinkedIn", "disambiguated_name": "LinkedIn", "linkdata": [ "http://dbpedia.org/resource/LinkedIn", "http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000003d3af7", "http://umbel.org/umbel/ne/wikipedia/LinkedIn", "http://mpii.de/yago/resource/LinkedIn" ] }, // etc ], //etc }