Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Web based content accessible via URL including:
    • Static HTML content;
    • RSS and ATOM based news feeds;
    • Restful web services interfaces.
  • Traditional relational database management systems (RDBMS) via Java Database Connectivity (JDBC) drivers;
  • Files located on local and network attached storage devices.

Need pretty graphic for these stepsThe following steps are followed:

  1. Extract data from sourceCreate feed document from source data, turn into documents, extract metadata from sources for XML, PDF etc (harvesting)
  2. Enrich source data by extracting entities, events, geographic/location data, etc. This is broken down into the following phases (note: the roadmap is to move this to a completely user-defined UIMA chain):
    1. Structured Analysis Handler, phase 1: fill in unstructured document-level fields (title, description, full text) from metadata, if needed.
    2. Unstructured Analysis Handler, phase 1: use regexes and javascript to pull out new metadata fields from the unstructured document-level fields.
    3. Unstructured Analysis Handler, phase 2: use regex replaces to transform the source text, if needed.
    4. Unstructured Analysis Handler, phase 3: use regexes and javascript to pull out new metadata fields from the cleansed unstructured document-level fields.
    5. Standard extraction, phase 1 (text extraction): use a "text extractor" to create the text that is submitted to the entity extraction service in the next phase (if needed, often the entity extraction service will combine the 2 phases).
    6. Standard extraction, phase 2 (entity extraction): use an "entity extractor" (eg AlchemyAPI) to pull out entities and associations from the submitted text/URL.
    7. Structured Analysis Handler, phase 2: create new entities from the metadata, combine entities from all phases into associations.
  3. Update entity counts/aggregates (generic processing - statistics)
  4. Store finished within Infinit.e's MongoDB data store and Elasticsearch index (generic processing - aggregation)

Creating a Source

The following WIKI pages describe detail the steps involved with creating sources:

...