Source Pipeline Elements

Source Pipeline Elements Toolkit

The following toolkit elements are available from the Source Builder.

Extractors

DB Extractor

Extract documents from relational database management system (RDBMS) records.

You can connect to the following database types:

mysql, db2, oracle, oracle:thin:sid, mssqlserver, sybase.  (Additional types can be added via configuration)

You can connect to the database of choice and then run queries against it, based on specific criteria.  New documents are then generated based on the query that can be used to create metadata, entities and associations.

For more in depth information, see section Database extractor.

File Extractor

Extract documents from a variety of local or networked file systems, covering line-seperated, XML, JSON, or "Office" file types.

The File Extractor is capable of ingesting files from the following locations:

  • Windows/Samba shares
  • harvester's local filesystem
  • Amazon S3

The File Extractor supports the following file types

  • Office documents (Word, Powerpoint etc.)
  • text-based documents (emails)
  • CSV
  • XML and JSON
  • Shares
  • The results of Plugins

 

You can configure the file harvester to specify the ingestion behaviour using the following key fields:

  • XmlRootLevelValues
  • XmlIgnoreValues
  • XmlSourceName
  • XmlPrimaryKey

For more detailed information, see section File extractor.

Feed Extractor

You can use the Feed extractor to extract documents from RSS feeds.  You also usually need to pass the extracted data to a text processing stage in order to get usable results.

For more detailed information, see section Feed extractor.

Web Extractor

Extracts documents from XML/HTM pointed to by the specified URL.

You can combine RSS feeds and web pages in the same source using configuration.  Also, you can specify if the returned web pages should be crawled, or simply used to follow additional URL links.

For more detailed information, see section Web extractor

Federated Query

Register external API calls that are converted into documents when the appropriate IKANOW queries are invoked.

TODO add to source pipeline docs.

Logstash Extractor

Import lightweight records into IKANOW using the elasticsearch Logstash import engine.

This extractor allows you to create records as opposed to documents.

When logstash is specified as the source, there is no Source Builder available, and a separate LS editor becomes available.  This extractor type cannot be used in conjunction with any other elements - all other pipeline elements are ignored when this one is specified.

 

Globals

Harvest control settings

Specify control over harvest frequency, duration etc.  For example, you can limit the amount of documents that can be harvested for a given source, or distribute a single source across multiple threads.

Add Global Javascript

Specify javascript globals that can be used by scripts in any toolkit elements that follow.

You can use this elements to declare javascript variables and functions that can be re-used by any individual "scriptlets" elsewhere.

For more detailed information, see section Javascript globals.

Add Lookup Tables

When using Javascript, it is possible to use Lookup tables, in order to access a set of global variables loaded at harvest time based on JSON shares, custom tables, or document collections.

For more detailed information, see section Lookup Tables.

Secondary Extractors

Specify if web pages/RSS pages should be used to generate documents, or simply crawled for additional URLs to follow.  The behaviour can be configured to accommodate both RSS feeds and web pages, within the same Source.

For more detailed information, see section Follow Web links

Split Documents

Works similarly to Follow Web Links, except that "splitter" can only be used on file/database sources.  For example, using splitter, you can ingest pages from an e-book into the platform and then generate new individual documents, deleting the original.

For more detailed information, see section Follow Web links.

Text Processing

Automated Text Extraction

This toolkit element passes the document text (or URL) to an external extraction engine to return the text that will be used for subsequent text transformation, metadata extraction, or entity extraction.

IKANOW automated text extraction can support the following engines

  • Alchemy API* ("alchemyapi" or "alchemyapi-metadata")
  • boilerpipe ("boilerpipe")
  • tika ("tika")
  • Together with the built-in:
    • "raw" - extracts the raw content from the URL, no processing occurs
    • "none" - removes existing text blocks from the document

*Alchemy API can perform both text extraction using the Alchemy API, and feature extraction using the Alchemy metadata API.  The Alchemy API configuration parameters are covered on the Feature extraction page.

For more detailed information, see section Automated text extraction

Manual Text Transformation

Use one or more of these to transform the text fields (particularely fullText) using regex, javascript, or XPath.

Using manual text transformation you can specify the data source for your script to work on.  The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.

For more detailed information, see section Manual text transformation.

Metadata

Document Metadata

This toolkit element allows you to use regex or javascript to set the document metadata fields.

When document metadata is extracted from a source (via the File, Database, or other technique), each field extracted is captured in the Feed.metadata object. Using document metadata, data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.

For more detailed information, see section Document metadata.

Content Metadata

This toolkit element allows you to use regex, javascript, or xpath to create metadata objects that can be used to generate entities or associations using other Toolkit elements.

"Flags" can be used to determine how the metadata fields will be stored, making them available for other operations and scripts later in the pipeline for the generation of entities and associations.

For more detailed information, see section Content metadata.

Entities and Associations

Automated Entities

This toolkit element passes the document text to an external extraction engine to return entities and associations and occasionally metadata.

IKANOW supports the following feature extraction engines:

  • Textrank* ("textrank")
  • OpenCalais* ("opencalais")
  • AlchemyAPI** ("alchemyapi")
  • AlchemyAPI-metadata** ("alchemyapi-metadata")
  • salience* ("salience")
  • regex* - a mechanism for converting regexes into entities from text or metadata ("regex")

*requires a text extractor beforehand.

**includes its own built-in text extractor, though can run behind an alternative text extractor also.

For more detailed information, see section Feature extraction.

Manual Entities

This toolkit element enables the generation of one or more types of entities based on the document or content metadata.  The expressions default to replacement strings, or $SCRIPT(...) can be used to return a string using javascript.

For example, you can specify the entity fields from the imported metadata using $metadata.  You can also iterate over metadata arrays to populate entity fields.

For more detailed information, see section Manual entities.

Manual Association of Entities

This toolkit element enables the generation of one or more types of associations between existing entities based on the document or content metadata.  the expressions default to the replacement strings, or $SCRIPT can be used to return a string using javascript.

For example, you can specify the associations fields from the imported metadata using $metadata.  You can also iterate over metadata arrays to create associations.

For more detailed information, see section Manual entities.

Storage and Indexing Settings

Search Index Settings

This toolkit element provides top-level control to the search-indexing of metadata, entities and associations.

For more detailed information, see section Search index settings

Document Storage Settings

This toolkit element provides control over whether documents are stored, and which metadata fields including special persistent fields are retained across document updates.

For more detailed information, see section Document storage settings.

 

 

 

In this section:


 

Related documentation:

Source Pipeline Documentation