Page Comparison

...

CE's harvesters are designed to consume data from a variety of sources and media types including:

Web based content accessible via URL including:
- Static HTML content;
- RSS and ATOM based news feeds;
- Restful web services interfaces.
Traditional relational database management systems (RDBMS) via Java Database Connectivity (JDBC) drivers;
Files located on local and network attached storage devices.

Source Pipeline

Harvesting and enrichment is a logical process based around the concept of applying a pipeline of processing elements to documents emanating from a source.

The following

...

Structured Analysis Handler, phase 1: fill in unstructured document-level fields (title, description, full text) from metadata, if needed.
Unstructured Analysis Handler, phase 1: use regexes and javascript to pull out new metadata fields from the unstructured document-level fields.
1. (Special case: if Tika is specified as the text extraction engine, then this is performed before any Unstructured Analysis Handler)
Unstructured Analysis Handler, phase 2: use regex replaces to transform the source text, if needed.
Unstructured Analysis Handler, phase 3: use regexes and javascript to pull out new metadata fields from the cleansed unstructured document-level fields.
Standard extraction, phase 1 (text extraction): use a "text extractor" to create the text that is submitted to the entity extraction service in the next phase (if needed, often the entity extraction service will combine the 2 phases).
Standard extraction, phase 2 (entity extraction): use an "entity extractor" (eg AlchemyAPI) to pull out entities and associations from the submitted text/URL.
Structured Analysis Handler, phase 2: the remaining document-level field (URL, published data, document geo ... plus the title and description if these returned null before, ie in case the UAH has filled in required fields)
Structured Analysis Handler, phase 3: create new entities from the metadata, combine entities from all phases into associations.

...

high level steps are applied to the source data, although there is considerable flexibility in the order of pipeline elements.

The pipeline elements can be approximately grouped into the following categories:

Extractors: Generate mostly empty CE documents from external data sources.
Globals: Generate javascript artifacts that can be used by subsequent pipeline elements.
Secondary extractors: Enables new documents to be spawned from the existing metadata.
Text extraction: Manipulation of the raw document content.
Metadata: Generate document metadata such as title, description, date; and arbitrary content metadata using xpath, regex, and javascript
Entities and associations: Create entities and associations out of the text.
Storage and indexing: Decide which documents to keep, what fields to keep, and what to "full text index" (for searching using the GUI/API).

Creating a Source

The following WIKI pages describe

...

Pipeline Documentation

Sample Source Documents

The following sample source documents are provided as an aid to learning how to create your own sources:

Source APIs:

API documentation

Panel

In this section:

Table of Contents

maxLevel	2
indent	16px

Versions Compared

Old Version 62

New Version Current

Key

Overview of the

Data Harvesting Process

Source Pipeline

Creating a Source

Source Reference Documents

Source Document Specification

Sample Source Documents

Source APIs: