Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Extract data from source, turn into documents, extract metadata from sources for XML, PDF etc (harvesting)
  2. Enrich source data by extracting entities, events, geographic/location data, etc. This is broken down into the following phases (enrichment; note: the roadmap is to move this to a completely user-defined UIMA chain):
    1. Structured Analysis Handler, phase 1: fill in unstructured document-level fields (title, description, full text) from metadata, if needed.
    2. Unstructured Analysis Handler, phase 1: use regexes and javascript to pull out new metadata fields from the unstructured document-level fields.
      1. (Special case: if Tika is specified as the text extraction engine, then this is performed before any Unstructured Analysis Handler)
    3. Unstructured Analysis Handler, phase 2: use regex replaces to transform the source text, if needed.
    4. Unstructured Analysis Handler, phase 3: use regexes and javascript to pull out new metadata fields from the cleansed unstructured document-level fields.
    5. Standard extraction, phase 1 (text extraction): use a "text extractor" to create the text that is submitted to the entity extraction service in the next phase (if needed, often the entity extraction service will combine the 2 phases).
    6. Standard extraction, phase 2 (entity extraction): use an "entity extractor" (eg AlchemyAPI) to pull out entities and associations from the submitted text/URL.
    7. Structured Analysis Handler, phase 2: the remaining document-level field (URL, published data, document geo ... plus the title and description if these returned null before, ie in case the UAH has filled in required fields)
    8. Structured Analysis Handler, phase 3: create new entities from the metadata, combine entities from all phases into associations.
  3. Update entity counts/aggregates (generic processing - statistics)
  4. Store finished within Infinit.e's MongoDB data store and Elasticsearch index (generic processing - aggregation)

...