Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 59 Next »

Overview of the Infinit.e Data Harvesting Process

The Infinit.e platform features a robust set of data harvesters that give Infinit.e a powerful data extraction and transformation (enrichment) capabilitie. Infinit.e's harvesters are designed to consume data from a variety of sources and media types including:

  • Web based content accessible via URL including:
    • Static HTML content;
    • RSS and ATOM based news feeds;
    • Restful web services interfaces.
  • Traditional relational database management systems (RDBMS) via Java Database Connectivity (JDBC) drivers;
  • Files located on local and network attached storage devices.

The following steps are followed:

  1. Extract data from source, turn into documents, extract metadata from sources for XML, PDF etc (harvesting)
  2. Enrich source data by extracting entities, events, geographic/location data, etc. This is broken down into the following phases (enrichment; note: the roadmap is to move this to a completely user-defined UIMA chain):
    1. Structured Analysis Handler, phase 1: fill in unstructured document-level fields (title, description, full text) from metadata, if needed.
    2. Unstructured Analysis Handler, phase 1: use regexes and javascript to pull out new metadata fields from the unstructured document-level fields.
    3. Unstructured Analysis Handler, phase 2: use regex replaces to transform the source text, if needed.
    4. Unstructured Analysis Handler, phase 3: use regexes and javascript to pull out new metadata fields from the cleansed unstructured document-level fields.
    5. Standard extraction, phase 1 (text extraction): use a "text extractor" to create the text that is submitted to the entity extraction service in the next phase (if needed, often the entity extraction service will combine the 2 phases).
    6. Standard extraction, phase 2 (entity extraction): use an "entity extractor" (eg AlchemyAPI) to pull out entities and associations from the submitted text/URL.
    7. Structured Analysis Handler, phase 2: the remaining document-level field (URL, published data, document geo ... plus the title and description if these returned null before, ie in case the UAH has filled in required fields)
    8. Structured Analysis Handler, phase 3: create new entities from the metadata, combine entities from all phases into associations.
  3. Update entity counts/aggregates (generic processing - statistics)
  4. Store finished within Infinit.e's MongoDB data store and Elasticsearch index (generic processing - aggregation)

Creating a Source

The following WIKI pages describe detail the steps involved with creating sources:

  1. Specifying a data source
    How to specify the mechanics required to extract data from a source system:
    1. Using the Feed Harvester
    2. Using the Database Harvester
    3. Using the File Harvester
  2. Structured Analysis - Overview
    An introduction to the Structured Analysis Harvester and how to specify the methods for enriching structured data sources with geographic information, entities, and events.
    1. Specifying Document Level Geographical Location
    2. Specifying Entities
    3. Specifying Associations
    4. Transforming Data with JavaScript
  3. Unstructured Analysis - Overview
    An introduction to the Unstructured Analysis Harvester.

A simple web-based GUI is available in conjunction with the structures described in these pages.

Source Reference Documents

Source Document Specification

The following links provide detailed information regarding the objects that make up a Source document and the individual fields within each object to support the introductory materials above.

Sample Source Documents

The following sample source documents are provided as an aid to learning how to create your own sources:

Source APIs:
  • No labels