Entities

Overview

Entities are the "who", "what" and "where" of the source data ingested into Community Edition (CE). Entities are comprised of basic metadata which describes the information architecture of the source documents, as well as statistics and other enrichment.

The JSON format of the Entities object describes an object with a specific "dimension" (who, where, what), as well as an Entity Type. Entity Types are described below.

The Entities format also includes statistics which are generated form CE's scoring algorithm, such as relevance, frequency and sentiment calculations. For more information, see section Scoring.

Entity Types

The set of values permitted by the "type" field depends on how the entity was extracted: Commercial third party extractors have set types, but some other entity extractors enable users to set custom entity types.

Some of the common entity types on the CE platform are defined below. Sources are also indicated.

Entity Type	Example	Description	Source
Topic	Business, Sports, Politics, Health, War, Law, Crime, Automotive, Investing, Weather, "Software and Internet", Economics, Food, Science, Aviation, Education,"Video Games", Technology, Labor, Art, Travel etc	A high level topic inferred from the contents by the Natural Language Processing.	Salience
Gender	Male, Mostly_Male, Female, Mostly_Female, Unisex	Obtained from the Datasift "gender" augmentation, an estimate of the gender of the document's author.	Datasift
FacebookUser	"Mark Zuckerberg", IKANOW, "Facebook Birdwatching Group"	For Facebook documents, any of the people/companies/groups with Facebook accounts mentioned in a post (including the author).	Datasift
TwitterUser	twitterHandle (ie without the leading'@')	For tweets, any of the people/companies/groups with twitter accounts mentioned in a post (including the author).	Datasift
RedditUser	witty_handle_here	For reddit posts, the author's account name.	Datasift
Person	"John Stewart"	For any other document type (blogs, news, forums) the author is categorized as a Person. Note that the Person type is also used for names extracted from the content using NLP.	Datasift (or Salience)
Hashtag	iwanttotrend (ie without the leading '#')	Hashtags in tweets.	Datasift
City	"New York, NY, United States"	Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place can be geolocated by CE to a city, then this type is used.	Datasift/Salience
Region	"Maryland, United States"	Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place can be geolocated by CE to a state or similar adminstrative partition, then this type is used.	Datasift/Salience
Place	"White House", "Arizona", "US"	Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place cannot be geolocated by CE then this "catch all" is used.	Datasift/Salience
URL	http://www.ikanow.com/downloads	Links in Facebook posts and tweets.	Datasift
Person	"Barack Obama"	A name extracted from the content by Salience and believed to be the name of a person.	Salience (or Datasift)
Job Title	President, CEO	A job title extracted from the content using Natural Language Processing.	Salience
Company/Organization	Microsoft, UN	A name extracted from the content by Salience and believed to be the name of a company or organization.
Quote	"Ask not for whom the bell tolls"	An unattributed quote extracted from the content.	Salience
Keyword	"american history", "domestic spying program"	A word or phrase from the content that is statistically significant to the meaning of the post.	Salience

Aggregations

It is worth mentioning that entities exist in their own right as a sub-object to the Document JSON format, but also exist as aggregations, as part of the query output parameters. In both cases, the Entities format is virtually the same except for some key differences that relate to their role within aggregations.

The only differences when they are aggregations are as follows:

No "actual_name" field.
No "relevance" or "sentiment" statistics, since these are specific to the mentions of an entity in a single document.
The "significance" and "frequency" fields are the maximum values occurring in the most relevant subset of matching results (normally the top 1000).

The following diagram exemplifies the difference between entities as sub-objects and as aggregations.

Aliasing

Aliasing enables you to specify duplicate entities and indicate which entity should be considered the "master" entity. Duplicate entities can also be discarded.

Aliases are JSON objects that can be configured to determine the behaviour of alias sets across Communities and the source data within them. The Alias configuration objects are uploaded via the File Uploader and shared to the specific Communities. The Aliases then apply to the queries that are run against data within those communities.

In this section:

Related Documentation:

Alias Manager (Enterprise)

File Uploader

Infinit.e Documentation

Entities

Analytics

Overview

Entity Types

Aggregations

Aliasing