Source Pipeline Elements Toolkit
The following toolkit elements are available from the Source Builder.
Table of Contents |
---|
Extractors
DB Extractor
Extract documents from relational database management system (RDBMS) records.
...
For more in depth information, see section Database extractor.
File Extractor
Extract documents from a variety of local or networked file systems, covering line-seperated, XML, JSON, or "Office" file types.
...
For more detailed information, see section File extractor.
Feed Extractor
You can use the Feed extractor to extract documents from RSS feeds. You also usually need to pass the extracted data to a Toolkit text processing stage in order to get usable results.
For more detailed information, see section Feed extractor.
Web Extractor
Extracts documents from XML/HTM pointed to by the specified URL.
...
For more detailed information, see section Web extractor
Federated Query
Register external API calls that are converted into documents when the appropriate IKANOW queries are invoked.
TODO add to source pipeline docs.
Logstash Extractor
Import lightweight records into IKANOW using the elasticsearch Logstash import engine.
...
Info |
---|
When logstash is specified as the source, there is no Source Builder available, and a seperate LS editor becomes available. This extractor type cannot be used in conjunction with any other elements - all other pipeline elements are ignored when this one is specified. |
Globals
Harvest control settings
Specify control over harvest frequency, duration etc. For example, you can limit the amount of documents that can be harvested for a given source, or distribute a single source across multiple threads.
Add Global Javascript
Specify javascript globals that can be used by scripts in any toolkit elements that follow.
...
For more detailed information, see section Javascript globals.
Add Lookup Tables
When using javascript with Infinit.e, it is possible to use Lookup tables, in order to access a set of global variables loaded at harvest time based on JSON shares, custom tables, or document collections.
For more detailed information, see section Lookup tablesTables.
Secondary Extractors
Follow Web Links
Specify if web pages/RSS pages should be used to generate documents, or simply crawled for additional URLs to follow. The behavior can be configured to accommodate both RSS feeds and web pages, within the same Source.
For more detailed information, see section Follow Web links
Split Documents
Works similarily to Follow Web Links, except that "splitter" can only be used on file/database sources. For example, using splitter, you can ingest pages from an e-book into Infinit.e and then generate new individual docuemnts, deleting the original.
For more detailed information, see section Follow Web links.
Anchor | ||||
---|---|---|---|---|
|
Automated Text Extraction
This toolkit element passes the document text (or URL) to an external extraction engine to return the text that will be used for subsequent text transformation, metadata extraction, or entity extraction.
...
For more detailed information, see section Automated text extraction
Manual Text Transformation
Use one or more of these to transform the text fields (particularely fullText) using regex, javascript, or XPath.
...
For more detailed information, see section Manual text transformation.
Metadata
Document Metadata
This toolkit element allows you to use regex or javascript to set the document metadata fields.
...
For more detailed information, see section Document metadata.
Content Metadata
This toolkit element allows you to use regex, javascript, or xpath to create metadata objects that can be used to generate entities or associations using other Toolkit elements.
...
For more detailed information, see section Content metadata.
Entities and Associations
Automated Entities
This toolkit element passes the document text to an external extraction engine to return entities and associations and occasionally metadata.
...
For more detailed information, see section Feature extraction.
Manual Entities
This toolkit element enables the generation of one or more types of entities based on the document or content metadata. The expressions default to replacement strings, or $SCRIPT(...) can be used to return a string using javascript.
...
For more detailed information, see section Manual entities.
Manual Association of Entities
This toolkit element enables the generation of one or more types of associations between existing entities based on the document or content metadata. the expressions default to the replacement strings, or $SCRIPT can be used to return a string using javascript.
...
For more detailed information, see section Manual entities.
Storage and Indexing Settings
Search Index Settings
This toolkit element provides top-level control to the search-indexing of metadata, entities and associations.
For more detailed information, see section Search index settings
Document Storage Settings
This toolkit element provides control over whether documents are stored, and which metadata fields including special persistent fields are retained across document updates.
...