Specifying Entities
What is An Entity?
Entities are the who, what, and where's contained within a record (i.e. people, places, and things).
Basic Entity Specification
The following code demonstrates how to specify a basic Where (Place) entity (Note: The sample entity specification and sample entity output below is extracted from a sample MySql Database Source.):
{ //... "entities" : [ { "disambiguated_name" : "$metadata.blocksiteaddress,$metadata.city,$metadata.state", "dimension" : "Where", "type" : "Place", "geotag" : { "lat" : "$metadata.latitude", "lon" : "$metadata.longitude" }, "ontology_type": "point" }, ], //... }
In the basic example above the following fields have been specified:
- disambiguated_name
For a given "type", this is (aside from case) a unique identifier for the entity - dimension
One of "Who" (people, organizations), "Where" (places), or "What" (everything else) - type
The entity type, i.e. if dimension is equal to What, type might be equal to Automobile, Airplane, Ship, etc. - geotag
- lat
String containing a floating point representation of latitude - lon
String containing a floating point representation of longitude
- lat
Data is extracted from the source using the $ operator. For example, in the case of the geotag.lat field the data is extracted from the metadata.latitude field using the following definition:
"lat" : "$metadata.latitude"
The $ operator can also be used to combine multiple source data fields into more complex literal strings as used to specify the document's description field:
"description" : "$metadata.reportdatetime: $metadata.offense,$metadata.method was reported at: $metadata.blocksiteaddress"
Which is converted into the following literal string:
"description" : "Mar 10, 2011 12:00:00 AM: ROBBERY GUN was reported at the 1100 B/O 1ST ST NW"
Note: More advanced data transformations can be performed within the Structured Analysis Harvester using JavaScript as documented here: Transforming data with JavaScript.
The result of the entity specification above can be seen in the sample output below:
{ //... "entities" : [ { "actual_name" : "1100 B/O 1ST ST NW WASHINGTON DC", "dimension" : "Where", "disambiguated_name" : "1100 B/O 1ST ST NW WASHINGTON DC", "doccount" : 3, "frequency" : 1, "index" : "1100 b/o 1st st nw washington dc/place", "geotag" : { "lat" : "38.9051666534795", "lon" : "-77.0121735726172" }, "ontology_type": "point", "relevance" : "0", "totalfrequency" : 3, "type" : "Place" }, ], //... }
In the sample output above please note that the Infinit.e harvest automatically generates the following fields as appropriate:
- doccount
The number of documents in which the entity occurs in the Infinit.e database - frequency
The number of times the entity occurs in the document (Note: the system defaults the frequency count to 1 however it is possible to specify a frequency count within a source document) - totalfrequency
The number of times the entity occurs in all documents in the Infinit.e database - relevance
A value between 0 and 1(in the form of a string containing a floating point number), indicating the entity extraction engine's "opinion" on the entity's relevance within the document
Create Entities from Arrays of Items - Basic Example
The Structured Analysis Harvester is capable of extracting data contained with JSON arrays (i.e. _doc.metadata.someTypeOfEntity[]) using the iterateOver field of the entity object as show below:
"entities" : [ //... { "iterateOver" : "location", "disambiguated_name" : "$SCRIPT( return _iterator.citystateprovince.city+','+_iterator.citystateprovince.stateprovince+','+_iterator.country; ); )", // (use _value if the iterating field is a primitive type eg a string; // _iterator if it is an object; _iterator.X to access the field X of the object, etc) "actual_name": "$citystateprovince.city,$citystateprovince.stateprovince,$country", // (or use the $ format - note that when using iterateOver, // you can't access $metadata.FIELD any more, the $ is offset from the last clause of "iterateOver") "useDocGeo" : true, "dimension" : "Where", "type" : "Location" } //... ]
In the example above the iterateOver value is set to "location" meaning that the Structured Analysis Harvester will iterate (or loop) over the metadata.location[] objects and create an entity for each object in the array. The source example below shows a location array with one location object:
"metadata" : { //... "location" : [ { "region" : "East Asia-Pacific", "citystateprovince" : { "stateprovince" : "Narathiwat", "city" : "Sungai Padi" }, "country" : "Thailand" } ], //... }
Nesting is supported using the "dot notation" eg if in the above instance, the location was inside an object (or array of objects) called "more_information", then the "iterateOver" field would be set to "more_information.location".
(This would be equivalent to the less tidy technique of nesting the Entity Specification JSON object, the first having "iterateOver": "more_information", and containing a second Entity Specification JSON object identical to the original example).
A few useful tips for using "iterateOver":
- Arrays and objects are treated equally in the dot-notation (ie an object is just treated like an array of size 1)
- eg for both "{ A: { B: { C: "value" } } }" and "{ A: [ B: [ { C: [ value ] } ] ] }", you would use "iterateOver": "A.B.C" to get to "value"
- If any of the fields point to primitives (eg B: [ "val1", "val2" ] in the example above) then an error is thrown unless the "creation criteria" script for the nested object is specified (C in this example).
- (You can still throw errors from the script by checking if "(_iterator==null)" if you want to) This enables writing objects that will handle fields being either primitives or objects (eg by checking vs _iterator and _value).
- For non-nested entity specifciation objects, the first field in the "iterateOver" field refers to the metadata object, eg "iterateOver": "location" refers to "_doc.metadata.location".
- (For nested objects, the first field refers to the "parent" object, but you shouldn't be using nesting now that dot notation is available!)
- Reminder: if you are iterating over:
- An object, then use "_iterator.FIELD" in scripts, "$FIELD" for normal strings.
- (Note that "$metadata.X" won't work inside "iterateOver" clauses, you have to use constructs like "$SCRIPT( return _doc.metadata.X[0]; )" to get at the top-level fields. We will probably fix this at some point.)
- A value then use "_value" in scripts, "$" for normal strings.
- An object, then use "_iterator.FIELD" in scripts, "$FIELD" for normal strings.
The example below demonstrates how a location entity is created from source data (Note: The full source and sample output created from the source can be found here: XML File Source).
"entities" : [ //... { "actual_name" : "Sungai Padi,Narathiwat,Thailand", "dimension" : "Where", "disambiguated_name" : "Sungai Padi,Narathiwat,Thailand", "doccount" : NumberLong(51), "frequency" : 1, "index" : "sungai padi,narathiwat,thailand/location", "geotag" : { "lat" : "6.085833", "lon" : "101.881389" }, "ontology_type": "city", "totalfrequency" : NumberLong(51), "type" : "Location" }, //... ],
Create Entities from Arrays of Items - Advanced Example
The following example of how to specify entities from within an array of items is similar to the basic example above but expands on it by showing how to extract multiple entities from each array item, by nesting them using the "entities" field:
"entities" : [ //... { "iterateOver" : "victim", "entities" : [ { "disambiguated_name" : "$SCRIPT( getVictim( _iterator ); )", "frequency" : "$FUNC( getVictimCount(); )", "dimension" : "Who", "type" : "VictimType" }, { "disambiguated_name" : "$SCRIPT( getVictim( _iterator ); )", "frequency" : "$hostagecount", "dimension" : "Who", "type" : "HostageType" } ] } //... ]
The entity specification above is designed to extract two entities from each victim object in the example source below. In the source sample below note that there are three counts for each victim object: wounded count, dead count, and hostage count. The two entities to be created from each victim object are identical except for following two differences:
- frequency
- Entity 1: the frequency count will be equal to the sum of "woundedcount" + "deadcount"
- Entity 2: the frequency count will be equal to "hostagecount"
- type
- Entity 1: type will equal "VictimType"
- Entity 2: type will equal "HostageType"
"metadata" : { //... "victim" : [ { "child" : "No", "indicator" : "Targeted", "nationality" : "Thailand", "targetedcharacteristic" : "Unknown", "woundedcount" : "1", "deadcount" : "1", "definingcharacteristic" : "None", "combatant" : "No", "hostagecount" : "2", "victimtype" : "Police" } ], //... }
The example output below demonstrates how two victim entities are created from the source data and entity specification found above:
"entities" : [ //... { "actual_name" : "Targeted, Police, Adult from Thailand", "dimension" : "Who", "disambiguated_name" : "Targeted, Police, Adult from Thailand", "doccount" : NumberLong(186), "frequency" : 2, "index" : "targeted, police, adult from thailand/victimtype", "totalfrequency" : NumberLong(478), "type" : "VictimType" }, { "actual_name" : "Targeted, Police, Adult from Thailand", "dimension" : "Who", "disambiguated_name" : "Targeted, Police, Adult from Thailand", "doccount" : NumberLong(40), "frequency" : 2, "index" : "targeted, police, adult from thailand/hostagetype", "totalfrequency" : NumberLong(53), "type" : "HostageType" }, //... ],
Nesting is not recommended except in cases where nested fields can refer to either primitives or objects. Just use the "dot notation" described above where possible.
You can nest entities an arbitrary number of times, provided there is a valid "iterateOver" field specified at every level. You can also make an entity object both specify an entity and contain an array, although this is not recommended.
Further Reading