Infinit.e - developing visualizations and plugins - custom entity extractor

Overview of entity extraction

Infinit.e's entity extractors take harvested documents, ie URLs (RSS/HTML), text (files), or metadata objects (XML, databases), and add meaning in the form of entities and associations between entities.

Examples of the built-in entity extractors include:

"textrank" - A "behind-the-firewall" OSS solution that uses TextRank (with some OpenNLP-based pre-processing) to extract key phrases from the text.
"AlchemyAPI" - uses the Named Entity Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). AlchemyAPI has the ability to extract associations (as well as much more), but this feature has not yet been integrated into the tool.
"OpenCalais" - extracts entities and associations using the free OpenCalais service. No sentiment analysis function is available at this time.
"AlchemyAPI-metadata" - uses the Keyword Extraction and Sentiment Analysis functions of the commercial AlchemyAPI service (there is a free tier for AlchemyAPI but it is very restrictive). This service also tags AlchemyAPI Concepts to documents as metadata.
Note that IKANOW has also integrated with behind-the-firewall entity extractors. These are typically commercial or GOTS products (our current favorite is Salience by Lexalytics). Please contact us for more details.

In addition to the above entity extractors, Infinit.e has three options for "text extractors", which convert URLs into text (eg Advert removal, HTML tag cleansing etc):

"AlchemyAPI" (not needed if it is being used as the entity extraction service)
"boilerpipe", an open source built-in HTML->text extractor (which may need help from the Unstructured Analysis Handler, "simpleTextCleanser" function on some sources).
- (Mostly AlchemyAPI outperforms boilerpipe, though this is not always the case, eg http://english.aljazeera.net/Services/Rss/?PostingId=2007731105943979989)
"tika", Takes URLs pointing to office documents, PDFs, emails, etc and uses Tika to convert them to text (and also to add metadata using the _FILE_METADATA_ field)

Developing custom entity extractors

This site will not go into much detail on how to develop a custom entity extractor, because the intention is to move to an "open standards" interface. For examples of wrapping text and entity extractors in the IEntityExtractor/ITextExtractor interfaces, check out the embedded extractors like TextExtractorBoilerpipe, TextExtractorTika and TextRankExtractor: other than putting them in a separate JAR, custom extractors are constructed the same way.

But in brief:

Create a JAR file comprising the following:
- A source file derived from IEntityExtractor (overriding all the functions, see below)
  - (or ITextExtractor to write a custom text extractor)
- Classes from the following library (available from the artifacts directory of the "Infinit.e OSS Gold" project of the JIRA build site):
  - infinit.e.data_model
    - (note unlike the other "core" libraries, the data model is Apache-licensed, so can be linked to from proprietary - or differently licensed - code).
Either: (recommended for production)
- Copy the JAR file into "/opt/infinite-home/lib/extractors/"
- Add the following line to the "infinite.api.properties" and "infinite.service.properties" files in "/opt/infinite-home"
  - extractor.entity.custom=<full class path of JAR>
    - eg "extractor.entity.custom=com.ikanow.infinit.e.harvest.custom.BuiltInKeywordExtractor"
    - (note multiple JARs can be specified like this, comma-separated on a single line)
- To use the "config/source/test" API call the Interface Engine must be restarted ("service tomcat6-interface-engine restart")
Or: (recommended for system development and testing)
- Upload the JAR via the file uploader, ensure it is shared across all communities for which you will be ingesting sources
- In the source, in the textEngine or featureEngine objects (or useTextExtractor/useExtractor for legacy sources) just specify the "_id" of the uploaded source (just the bit after the "api/social/share/get, not the entire URL)..
  - (Note that once used once in a source, the extractor binary is cached until the API is restarted, so to upload a different version as a developer you must delete/recreate the share each time - this issue should be fixed at some point)

Finally a quick description of what to do in each of the functions (the built-in versions can be used as examples, eg OpenCalais and boilerpipe):

getName

Return a globally unique string - this is the string (case insensitive) that should be specified in the "useExtractor" or "useTextExtractor" fields of the source specification.

extractEntities

A DocumentPojo (see JSON specification) called "partialDoc' is returned, with metadata and fullText fields populated. Develop code to create entities and associations, and append them to the entities and association fields of the document. See the "Entity extraction" section below.

(Also true for "extractEntitiesAndText") As described below under "batching", extractor modules should accept calls with partialDoc==null. In most cases they can just immediately return when this is the case.

extractEntitiesAndText

Like "extractEntities" but "partialDoc" has its "url" field populated but not "fullText", ie this function needs to extract the text first. See the "Entity extraction" section below.

Note that this function is only called when an IEntityExtractor is also an ITextExtractor (ITextExtractors have only one interface function apart from getName, "extractText").

getCapability

This is currently unused (and is unlikely to become used now that the plan is to move to UIMA), just return null.

Entity extraction - more details and examples

In most cases, the entity extractor will take the full text of the document and return a list of entities, for example:

	String text = partialDoc.getFullText();
	List<ThirdPartyEntityObj> ents = thirdPartyEntityExtractor.process(text);
	for (ThirdPartyEntityObj entToConvert: ents) {
		if (null == partialDoc.getEntities()) {
			partialDoc.setEntities(new ArrayList<EntityPojo>(ents.size()));
		}
		// Mandatory fields
		EntityPojo ent = new EntityPojo();
		ent.setDisambiguatedName(entToConvert.getValue());
		// (can leave actual name alone - if present should be an alias of the disambiguated name)
		ent.setType(entToConvert.getType()); // (any string value can be used as type)
		ent.setDimension(EntityPojo.Dimension.What); // (or What/When/Where - mostly just used for iconography in the GUI)
		ent.setRelevance(0.5); // double, [0.0, 1.0] allows the extractor to set the relative relevance of entities in a doc
		ent.setFrequency(1L); // the number of times an entity occurs within the document

		// Other fields are optional, eg
		//
		//ent.setGeotag(GeoPojo); // (GeoPojo), if this is set then should also set ontological type (defaults to "point")
		//ent.setOntologicalType(String); // one of "contintent", "country", "countrysubsidiary", "city", "point"; or "geographicalregion"
		//
		//ent.setSentiment(double); // double in between -1.0 and 1.0
		//ent.setLinkdata(List<String>); // list of URLs
	}

The entity JSON is described further here. The data model class "DimensionUtility" maps many standard type strings onto the Dimension (soon this will be a user-extendable list).

The entity extractor is free to perform whatever transforms it wants however - any fields can be modified, and any fields can be used as the input.

Example: the IP geolocator will scan the existing entities for IP-address-like strings, and tag them with geo.
Example: The Salience entity extractor will also fill in the "description" field with

Although types can be any arbitrary string, in practice they should be chosen to be part of a consistent ontology. Future versions of Infinit.e will allow administrators to place restrictions on the entity types allowed with a given community. The current platform has a number of "special entity types":

Keyword (typically a word or phrase that is not "understood" by the NLP, but is statistically significant within the document)
Topic, SubTopic: these can be hand crafted by source developers, or link to Wiki topics, etc.

One other very standard entity extractor operation is taking entities and generating associations out of them. The basic idea is the same (the code should check if "partialDoc.getAssociations()" is null and insert an "ArrayList<AssociationPojo>" if so), the following fields are mandatory:

One of:
- "entity1_index"
- "entity2_index"
- "geo_index"
"verb_category"
"assoc_type" - possible values described below

The association should be viewed as "<entity1_index> does <verb_category> to <entity2_index> at <geo_index> between <time_start> and <time_end>".

The following points should be noted:

Any field ending in "_index" must either be null or filled with the "entity.index" of an entity in the "partialDoc.getEntities()" list.
If only one entity is filled in, then the "assoc_type" must be "Summary", otherwise:
- If the association is transient (eg "obama visits europe") then it should be an "Event"
- If the association has some degree of permanence ("england is a country" vs "chris is the CEO") then it should be a "Fact"
If "geo_index" is populated then the "geotag" object should also be populated
"time_start" and "time_end" are both optional, with:
- "time_end" is not allowed unless "time_start" is populated
- "time_end" can be left blank if "time_start" is populated - this represents a "point in time event"
"entity1", "entity2", "verb" are optional fields that can represent "aliases"
- "entity2" can also be used to attach blocks of text to entity1_index entities - eg a "quotation" association.

Entity extractors can remove existing entities and associations, though this is not standard behavior.

Text extraction

The text extractor will typically take a URL from either "partialDoc.getUrl()" or "partialDoc.getDisplayUrl()" and use it to fill in the "fullText" field ("partialDoc.setFullText(contents)"). In some cases it will instead be used to transform the "fullText" field - eg pull out relevant/interesting text from an HTML also containing adverts, navigation links etc (this is what Boilerpipe does).

Note that text extractors that support local URL fetching (vs sending URLs to 3rd party services) are responsible for ensuring that the security sandbox is not breached (eg "localhost:9200" gives unfettered access to the elasticsearch index). Examples of how to do this can be found in the Tika and Boilerpipe source code.

Extractor options

The "engineConfig" parameter from the Automated text extraction and Feature extraction elements of the source pipeline is visible in each document passed to custom extractors as:

Map<String, String> options = partialDoc.getTempSource().getExtractorOptions(); // (can be null, if no engineConfig specified)

Note that the map is guaranteed not to change between documents in a given source, so you can safely cache the map for the first document in a given instance of an extractor.

Entity extractor batching

In most cases the entity extractor should synchronously handle each document (there is a separate threading model, see below).

Entity extractors can batch documents up - the framework will call "extractEntities()" with null, and this indicates that the module should populate all documents.

Note that the other components in the source pipeline do not support batching and therefore this function should only be used in very simple pipelines (or right at the end of more complex pipelines).

Example: this can be used with AlchemyAPI to batch up tweets to save on API key costs.

Module persistance and threading considerations

When run from the harvester, each thread (there are a configurable number of threads) will instantiate a separate instance of the module, and will call the module once per document, synchronously and from the same thread. Therefore if the extractor module can use large amounts of memory, or has a very slow instantiation time, then shared memory should be used. (Or run the extractor in a separate persistent process and use some remote method invocation).

Compiling and testing extractors

Custom extractors should be compiled against a fairly recent data model JAR. This is available from our open source binary repository (or can be compiled by hand from our github repository).

When building a JAR to upload, you should not include the data model JAR (since it's a waste of JAR space, and could potentially cause conflicts across the -rare- non-backwards compatible JAR releases).

To test a JAR, it can be uploaded via the file uploader, with the title being the fully qualified class path of the class implementing the IEntityExtractor interface (eg "com.ikanow.infinit.e.extractor.TestExtractor"). The object id of the uploaded file can then be used in the engineName property of a featureEngine element from the source editor.