Entities

Overview

Entities are the "who", "what" and "where" of the source data ingested into Community Edition (CE).  Entities are comprised of basic metadata which describes the information architecture of the source documents, as well as statistics and other enrichment.

The JSON format of the Entities object describes an object with a specific "dimension" (who, where, what), as well as an Entity Type.  Entity Types are described below.

The Entities format also includes statistics which are generated form CE's scoring algorithm, such as relevance, frequency and sentiment calculations.  For more information, see section Scoring.

Entity Types

The set of values permitted by the "type" field depends on how the entity was extracted: Commercial third party extractors have set types, but some other entity extractors enable users to set custom entity types.

Some of the common entity types on the CE platform are defined below.  Sources are also indicated.

Entity TypeExampleDescriptionSource
Topic

Business, Sports, Politics, Health, War, Law, Crime, Automotive, Investing, Weather, "Software and Internet", Economics, Food, Science, Aviation, Education,"Video Games", Technology, Labor, Art, Travel etc

 A high level topic inferred from the contents by the Natural Language Processing.Salience
GenderMale, Mostly_Male, Female, Mostly_Female, UnisexObtained from the Datasift "gender" augmentation, an estimate of the gender of the document's author.Datasift
FacebookUser"Mark Zuckerberg", IKANOW, "Facebook Birdwatching Group"For Facebook documents, any of the people/companies/groups with Facebook accounts mentioned in a post (including the author).Datasift
TwitterUsertwitterHandle (ie without the leading'@')For tweets, any of the people/companies/groups with twitter accounts mentioned in a post (including the author).Datasift
RedditUserwitty_handle_hereFor reddit posts, the author's account name.Datasift
Person"John Stewart"

For any other document type (blogs, news, forums) the author is categorized as a Person. Note that the Person type is also used for names extracted from the content using NLP.

Datasift

(or Salience)

Hashtagiwanttotrend (ie without the leading '#')Hashtags in tweets.Datasift
City"New York, NY, United States"

Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place can be geolocated by CE to a city, then this type is used.

Datasift/Salience
Region"Maryland, United States"Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place can be geolocated by CE to a state or similar adminstrative partition, then this type is used.Datasift/Salience
Place"White House", "Arizona", "US"Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place cannot be geolocated by CE then this "catch all" is used.

Datasift/Salience

URLhttp://www.ikanow.com/downloadsLinks in Facebook posts and tweets.Datasift
Person"Barack Obama"A name extracted from the content by Salience and believed to be the name of a person.

Salience

(or Datasift)

Job TitlePresident, CEOA job title extracted from the content using Natural Language Processing.Salience
Company/OrganizationMicrosoft, UNA name extracted from the content by Salience and believed to be the name of a company or organization. 
Quote"Ask not for whom the bell tolls"An unattributed quote extracted from the content.Salience
Keyword"american history", "domestic spying program"A word or phrase from the content that is statistically significant to the meaning of the post.Salience

 

Aggregations

It is worth mentioning that entities exist in their own right as a sub-object to the Document JSON format, but also exist as aggregations, as part of the query output parameters.  In both cases, the Entities format is virtually the same except for some key differences that relate to their role within aggregations.

The only differences when they are aggregations are as follows:

  • No "actual_name" field.
  • No "relevance" or "sentiment" statistics, since these are specific to the mentions of an entity in a single document.
  • The "significance" and "frequency" fields are the maximum values occurring in the most relevant subset of matching results (normally the top 1000).

The following diagram exemplifies the difference between entities as sub-objects and as aggregations.

Aliasing

Aliasing enables you to specify duplicate entities and indicate which entity should be considered the "master" entity.  Duplicate entities can also be discarded.

Aliases are JSON objects that can be configured to determine the behaviour of alias sets across Communities and the source data within them.  The Alias configuration objects are uploaded via the File Uploader and shared to the specific Communities. The Aliases then apply to the queries that are run against data within those communities.

 

In this section: