Source Configuration
- Craig Vitter (Unlicensed)
- andrew johnston (Unlicensed)
- AlexI (Unlicensed)
- Caleb Burch (Unlicensed)
Overview of the Data Harvesting Process
The Community Edition (CE) platform features a robust set of data harvesters that give it powerful data extraction and transformation (enrichment) capabilities. CE's harvesters are designed to consume data from a variety of sources and media types including:
- Web based content accessible via URL including:
- Static HTML content;
- RSS and ATOM based news feeds;
- Restful web services interfaces.
- Traditional relational database management systems (RDBMS) via Java Database Connectivity (JDBC) drivers;
- Files located on local and network attached storage devices.
Source Pipeline
Harvesting and enrichment is a logical process based around the concept of applying a pipeline of processing elements to documents emanating from a source.
The following high level steps are applied to the source data, although there is considerable flexibility in the order of pipeline elements.
The pipeline elements can be approximately grouped into the following categories:
- Extractors: Generate mostly empty CE documents from external data sources.
- Globals: Generate javascript artifacts that can be used by subsequent pipeline elements.
- Secondary extractors: Enables new documents to be spawned from the existing metadata.
- Text extraction: Manipulation of the raw document content.
- Metadata: Generate document metadata such as title, description, date; and arbitrary content metadata using xpath, regex, and javascript
- Entities and associations: Create entities and associations out of the text.
- Storage and indexing: Decide which documents to keep, what fields to keep, and what to "full text index" (for searching using the GUI/API).
Creating a Source
The following WIKI pages describe the source creation steps:
- Extractors
How to specify the mechanics required to extract data from a source system: - Entities and associations
An introduction to the Structured Analysis Harvester and how to specify the methods for enriching structured data sources with geographic information, entities, and events. - Metadata
A simple web-based GUI is available in conjunction with the structures described in these pages.
Source Reference Documents
Source Document Specification
The following links provide detailed information regarding the objects that make up a Source document and the individual fields within each object.
Sample Source Documents
The following sample source documents are provided as an aid to learning how to create your own sources:
- RSS Feed Source
- MySQL Database Source
- XML File Source
- Unstructured File Source
- Log File Source
- JSON File Source
Source APIs:
In this section: