Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Source Pipeline Elements Toolkit

The following toolkit elements are available from the Source Builder.

Table of Contents

Extractors

DB Extractor

Extract documents from relational database management system (RDBMS) records.

...

For more in depth information, see section Database extractor.

File Extractor

Extract documents from a variety of local or networked file systems, covering line-seperated, XML, JSON, or "Office" file types.

...

For more detailed information, see section File extractor.

Feed Extractor

You can use the Feed extractor to extract documents from RSS feeds.  You also usually need to pass the extracted data to a Toolkit text processing stage in order to get usable results.

For more detailed information, see section Feed extractor.

Web Extractor

Extracts documents from XML/HTM pointed to by the specified URL.

...

For more detailed information, see section Web extractor

Federated Query

Register external API calls that are converted into documents when the appropriate IKANOW queries are invoked.

TODO add to source pipeline docs.

Logstash Extractor

Import lightweight records into IKANOW using the elasticsearch Logstash import engine.

...

Info

When logstash is specified as the source, there is no Source Builder available, and a seperate LS editor becomes available.  This extractor type cannot be used in conjunction with any other elements - all other pipeline elements are ignored when this one is specified.

 

Globals

Harvest control settings

Specify control over harvest frequency, duration etc.  For example, you can limit the amount of documents that can be harvested for a given source, or distribute a single source across multiple threads.

Add Global Javascript

Specify javascript globals that can be used by scripts in any toolkit elements that follow.

...

For more detailed information, see section Javascript globals.

Add Lookup Tables

When using javascript with Infinit.e, it is possible to use Lookup tables, in order to access a set of global variables loaded at harvest time based on JSON shares, custom tables, or document collections.

For more detailed information, see section Lookup tablesTables.

Secondary Extractors

Specify if web pages/RSS pages should be used to generate documents, or simply crawled for additional URLs to follow.  The behavior can be configured to accommodate both RSS feeds and web pages, within the same Source.

For more detailed information, see section Follow Web links

Split Documents

Works similarily to Follow Web Links, except that "splitter" can only be used on file/database sources.  For example, using splitter, you can ingest pages from an e-book into Infinit.e and then generate new individual docuemnts, deleting the original.

For more detailed information, see section Follow Web links.

Anchor
text processing
text processing
Text Processing

Automated Text Extraction

This toolkit element passes the document text (or URL) to an external extraction engine to return the text that will be used for subsequent text transformation, metadata extraction, or entity extraction.

...

For more detailed information, see section Automated text extraction

Manual Text Transformation

Use one or more of these to transform the text fields (particularely fullText) using regex, javascript, or XPath.

...

For more detailed information, see section Manual text transformation.

Metadata

Document Metadata

This toolkit element allows you to use regex or javascript to set the document metadata fields.

...

For more detailed information, see section Document metadata.

Content Metadata

This toolkit element allows you to use regex, javascript, or xpath to create metadata objects that can be used to generate entities or associations using other Toolkit elements.

...

For more detailed information, see section Content metadata.

Entities and Associations

Automated Entities

This toolkit element passes the document text to an external extraction engine to return entities and associations and occasionally metadata.

...

For more detailed information, see section Feature extraction.

Manual Entities

This toolkit element enables the generation of one or more types of entities based on the document or content metadata.  The expressions default to replacement strings, or $SCRIPT(...) can be used to return a string using javascript.

...

For more detailed information, see section Manual entities.

Manual Association of Entities

This toolkit element enables the generation of one or more types of associations between existing entities based on the document or content metadata.  the expressions default to the replacement strings, or $SCRIPT can be used to return a string using javascript.

...

For more detailed information, see section Manual entities.

Storage and Indexing Settings

Search Index Settings

This toolkit element provides top-level control to the search-indexing of metadata, entities and associations.

For more detailed information, see section Search index settings

Document Storage Settings

This toolkit element provides control over whether documents are stored, and which metadata fields including special persistent fields are retained across document updates.

...