Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following toolkit elements are available from the Source Builder.

Table of Contents

Extractors

DB Extractor

Extract documents from relational database management system (RDBMS) records.

You can connect to the following database types:

mysql, db2, oracle, oracle:thin:sid, mssqlserver, sybase.  (Additional types can be added via configuration)

You can connect to the database of choice and then run queries against it, based on specific criteria.  new documents are then generated based on the query that can be used to create metadata, entities and associations.

For more in depth information, see section Database extractor.

File Extractor

Extract documents from a variety of local or networked file systems, covering line-seperated, XML, JSON, or "Office" file types.

The File Extractor is capable of ingesting files from the following locations:

  • Windows/Samba shares
  • harvester's local filesystem
  • Amazon S3

The File Extractor supports the following file types

  • Office documents (Word, Powerpoint etc.)
  • text-based documents (emails)
  • CSV
  • XML and JSON
  • Infinit.e shares
  • The results of Infinit.e plugins

 

You can configure the file harvester to specify the ingestion behavior using the following key fields:

  • XmlRootLevelValues
  • XmlIgnoreValues
  • XmlSourceName
  • XmlPrimaryKey

For more detailed information, see section File extractor.

Feed Extractor

You can use the Feed extractor to extract documents from RSS feeds.  You also usually need to pass the extracted data to a Toolkit stage in order to get usable results.

For more detailed information, see section Feed extractor.

Web Extractor

Extracts documents from XML/HTM pointed to by the specified URL.

You can combine RSS feeds and web pages in the same source using configuration.  Also, you can specify if the returned web pages shoudl be crawled, or simply used to follow additional URL links.

For more detailed information, see section Web extractor

Federated Query

Register external API calls that are converted into documents when the appropriate IKANOW queries are invoked.

TODO add to source pipeline docs.

Logstash Extractor

Import lightweight records into IKANOW using the elasticsearch Logstash import engine.

This extractor allows you to create records as opposed to documents.

Info

When logstash is specified as the source, there is no Source Builder available, and a seperate LS editor becomes available.  This extractor type cannot be used in conjunction with any other elements - all other pipeline elements are ignored when this one is specified.

 

Globals

Harvest control settings

Specify control over harvest frequency, duration etc.  For example, you can limit the amount of documents that can be harvested for a given source, or distribute a single source across multiple threads.

Add Global Javascript

Specify javascript globals that can be used by scripts in any toolkit elements that follow.

You can use this elements to declare javascript variables and functions that can be re-used by any individual "scriptlets" elsewhere.

For more detailed information, see section Javascript globals.

Add Lookup Tables

When using javascript with Infinit.e, it is possible to use Lookup tables, in order to access a set of global variables loaded at harvest time based on JSON shares, custom tables, or document collections.

For more detailed information, see section Lookup tables.

Secondary Extractors

Specify if web pages/RSS pages should be used to generate documents, or simply crawled for additional URLs to follow.  The behavior can be configured to accommodate both RSS feeds and web pages, within the same Source.

For more detailed information, see section Follow Web links

Split Documents

Works similarily to Follow Web Links, except that "splitter" can only be used on file/database sources.  For example, using splitter, you can ingest pages from an e-book into Infinit.e and then generate new individual docuemnts, deleting the original.

For more detailed information, see section Follow Web links.

Anchor
text processing
text processing
Text Processing

Automated Text Extraction

This toolkit element passes the document text (or URL) to an external extraction engine to return the text that will be used for subsequent text transformation, metadata extraction, or entity extraction.

IKANOWS automated text extraction can support the following engines

  • Alchemy API* ("alchemyapi" or "alchemyapi-metadata")
  • boilerpipe ("boilerpipe")
  • tika ("tika")
  • Together with the built-in:
    • "raw" - extracts the raw content from the URL, no processing occurs
    • "none" - removes existing text blocks from the document

*Alchemy API can perform both text extraction using the Alchemy API, and feature extraction using the Alchemy metadata API.  The Alchemy API configuration parameters are covered on the Feature extraction page.

For more detailed information, see section Automated text extraction

Manual Text Transformation

Use one or more of these to transform the text fields (particularely fullText) using regex, javascript, or XPath.

Using manual text transformation you can specify the data source for your script to work on.  The script is used to enrich the data from the data sources so it can be outputted as metadata for the creation of advanced entities and associations.

For more detailed information, see section Manual text transformation.

Metadata

Document Metadata

This toolkit element allows you to use regex or javascript to set the document metadata fields.

When document metadata is extracted from a source (via the File, Database, or other technique), each field extracted is captured in the Feed.metadata object. Using document metadata, data stored in the Metadata object can be access using the $ operator to signify that we are attempting to retrieve data from a field in our document.

For more detailed information, see section Document metadata.

Content Metadata

This toolkit element allows you to use regex, javascript, or xpath to create metadata objects that can be used to generate entities or associations using other Toolkit elements.

"Flags" can be used to determine how the metadata fields will be stored, making them available for other operations and scripts later in the pipeline for the generation of entities and associations.

For more detailed information, see section Content metadata.

Entities and Associations

Automated Entities

Manual Entities

Manual Association of Entities

Storage and Indexing Settings

Search Index Settings

Document Storage Settings