Entities
- Josh Liss (Unlicensed)
- andrew johnston (Unlicensed)
Overview
Entities are the "who", "what" and "where" of the source data ingested into Community Edition (CE). Entities are comprised of basic metadata which describes the information architecture of the source documents, as well as statistics and other enrichment.
The JSON format of the Entities object describes an object with a specific "dimension" (who, where, what), as well as an Entity Type. Entity Types are described below.
The Entities format also includes statistics which are generated form CE's scoring algorithm, such as relevance, frequency and sentiment calculations. For more information, see section Scoring.
Entity Types
The set of values permitted by the "type" field depends on how the entity was extracted: Commercial third party extractors have set types, but some other entity extractors enable users to set custom entity types.
Some of the common entity types on the CE platform are defined below. Sources are also indicated.
Entity Type | Example | Description | Source |
---|---|---|---|
Topic | Business, Sports, Politics, Health, War, Law, Crime, Automotive, Investing, Weather, "Software and Internet", Economics, Food, Science, Aviation, Education,"Video Games", Technology, Labor, Art, Travel etc | A high level topic inferred from the contents by the Natural Language Processing. | Salience |
Gender | Male, Mostly_Male, Female, Mostly_Female, Unisex | Obtained from the Datasift "gender" augmentation, an estimate of the gender of the document's author. | Datasift |
FacebookUser | "Mark Zuckerberg", IKANOW, "Facebook Birdwatching Group" | For Facebook documents, any of the people/companies/groups with Facebook accounts mentioned in a post (including the author). | Datasift |
TwitterUser | twitterHandle (ie without the leading'@') | For tweets, any of the people/companies/groups with twitter accounts mentioned in a post (including the author). | Datasift |
RedditUser | witty_handle_here | For reddit posts, the author's account name. | Datasift |
Person | "John Stewart" | For any other document type (blogs, news, forums) the author is categorized as a Person. Note that the Person type is also used for names extracted from the content using NLP. | Datasift (or Salience) |
Hashtag | iwanttotrend (ie without the leading '#') | Hashtags in tweets. | Datasift |
City | "New York, NY, United States" | Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place can be geolocated by CE to a city, then this type is used. | Datasift/Salience |
Region | "Maryland, United States" | Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place can be geolocated by CE to a state or similar adminstrative partition, then this type is used. | Datasift/Salience |
Place | "White House", "Arizona", "US" | Locations can be obtained in one of two ways: the registered location of the author (from Datasift), or places mentioned in the content (extracted using NLP). If the place cannot be geolocated by CE then this "catch all" is used. | Datasift/Salience |
URL | http://www.ikanow.com/downloads | Links in Facebook posts and tweets. | Datasift |
Person | "Barack Obama" | A name extracted from the content by Salience and believed to be the name of a person. | Salience (or Datasift) |
Job Title | President, CEO | A job title extracted from the content using Natural Language Processing. | Salience |
Company/Organization | Microsoft, UN | A name extracted from the content by Salience and believed to be the name of a company or organization. | |
Quote | "Ask not for whom the bell tolls" | An unattributed quote extracted from the content. | Salience |
Keyword | "american history", "domestic spying program" | A word or phrase from the content that is statistically significant to the meaning of the post. | Salience |
Aggregations
It is worth mentioning that entities exist in their own right as a sub-object to the Document JSON format, but also exist as aggregations, as part of the query output parameters. In both cases, the Entities format is virtually the same except for some key differences that relate to their role within aggregations.
The only differences when they are aggregations are as follows:
- No "actual_name" field.
- No "relevance" or "sentiment" statistics, since these are specific to the mentions of an entity in a single document.
- The "significance" and "frequency" fields are the maximum values occurring in the most relevant subset of matching results (normally the top 1000).
The following diagram exemplifies the difference between entities as sub-objects and as aggregations.
Aliasing
Aliasing enables you to specify duplicate entities and indicate which entity should be considered the "master" entity. Duplicate entities can also be discarded.
Aliases are JSON objects that can be configured to determine the behaviour of alias sets across Communities and the source data within them. The Alias configuration objects are uploaded via the File Uploader and shared to the specific Communities. The Aliases then apply to the queries that are run against data within those communities.
In this section: