Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Infinit.e's entity extractors take harvested documents, ie URLs (RSS/HTML), text (files), or metadata objects (XML, databases), and add meaning in the form of entities and associations between entities.

...

  • Create a JAR file comprising the following:
    • A source file derived from IEntityExtractor (overriding all the functions, see below)
    • Classes from the following library (available from the artifacts directory of the "Infinit.e OSS Gold" project of the JIRA build site):
      • infinit.e.data_model
        • (note unlike the other "core" libraries, the data model is Apache-licensed, so can be linked to from proprietary - or differently licensed - code).
  • Either: (recommended for production)
    • Copy the JAR file into "/opt/infinite-home/lib/extractors/"
    • Add the following line to the "infinite.api.properties" and "infinite.service.properties" files in "/opt/infinite-home"
      • extractor.entity.custom=<full class path of JAR>
        • eg "extractor.entity.custom=com.ikanow.infinit.e.harvest.custom.BuiltInKeywordExtractor"
        • (note multiple JARs can be specified like this, comma-separated on a single line)
    • To use the "config/source/test" API call the Interface Engine must be restarted ("service tomcat6-interface-engine restart")
  • Or: (recommended for system development and testing)
    • Upload the JAR via the file uploader, ensure it is shared across all communities for which you will be ingesting sources
    • In the source, in the textEngine or featureEngine objects (or useTextExtractor/useExtractor for legacy sources) just specify the "_id" of the uploaded source (just the bit after the "api/social/share/get, not the entire URL)..
      • (Note that once used once in a source, the extractor binary is cached until the API is restarted, so to upload a different version as a developer you must delete/recreate the share each time - this issue should be fixed at some point)

...

Return a globally unique string - this is the string (case insensitive) that should be specified in the "useExtractor" or "useTextExtractor" fields of the source specification.

extractEntities

A DocumentPojo (see JSON specification) called "partialDoc' is returned, with metadata and fullText fields populated. Develop code to create entities and associations, and append them to the entities and association fields of the document. See the "Entity extraction" section below.

...

When building a JAR to upload, you should not include the data model JAR (since it's a waste of JAR space, and could potentially cause conflicts across the -rare- non-backwards compatible JAR releases).

To test a JAR, it can be uploaded via the file uploader, with the title being the fully qualified class path of the class implementing the IEntityExtractor interface (eg "com.ikanow.infinit.e.extractor.TestExtractor"). The object id of the uploaded file can then be used in the engineName property of a featureEngine element from the source editor.