Harvest control settings
Overview
Controls a set of generic source management parameters.
Format
{ "display": string, "harvest": { "searchCycle_secs":integer, // How often to run the harvester (copied to SourcePojo when published) "duplicateExistingUrls":Boolean,// If false (defaults to true) then documents matching the URL of any existing document in the community is ignored (copied to SourcePojo when published) "maxDocs_global":Integer, // If specified, limits the total number of documents that can be harvested for a given source - when new documents are harvested exceeding this limit, older documents are deleted to maintain the size "throttleDocs_perCycle":Integer, // If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS_ITERATION ie the next harvest cycle, the harvester will pick up again, as above) "maxDocs_perCycle":Integer, // If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS - ie this+searchCycle_secs limits document ingest rate) "distributionFactor":Integer, // (EXPERIMENTAL) If specified, attempts to distribute the source across many threads } }
Description
The following table describes the parameters of the harvest control settings configuration.
Field | Description |
---|---|
searchCycle_secs | Optional, if set then the source will only be harvested every "searchCycle_secs" seconds (eg set to 86400 to recheck source daily, set to -1, or - the current value, to disable source temporarily |
duplicateExistingUrls | Optional, if false then this source will never duplicate existing documents within the community, matching solely on URL ie even if the processing performed is different (note by default, deduplication on URL occurs within a source but not within a community) |
maxDocs_global | If specified, limits the total number of documents that can be harvested for a given source - when new documents are harvested exceeding this limit, older documents are deleted to maintain the size |
throttleDocs_perCycle | If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS_ITERATION ie the next harvest cycle, the harvester will pick up again) |
maxDocs_perCycle | If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS - ie this+searchCycle_secs limits document ingest rate - the source will not be re-ingested until ) |
distributionFactor | (EXPERIMENTAL) If specified, attempts to distribute the source across the number of threads specified |
Examples
Basic
The following code sample exemplifies the basic functionality of the Harvest control settings:
"processingPipeline": [ //... { "harvest": { "searchCycle_secs": 3600, "duplicateExistingUrls": false, "maxDocs_global": 5000 } }, //... ]
This control sets the following behavior:
- The source will only be run every hour
- If a document's URL already exists from any other source in that community, the new document will be discarded
- Only the most recent 5000 documents will be retained
Throttling Per Cycle
The following examples compare the different use cases for configuring throttledocs_perCycle
and maxdocs_perCycle.
Compare:
"processingPipeline": [ //... { "harvest": { "searchCycle_secs": 3600, "throttleDocs_perCycle": 100 } }, //... ]
vs
"processingPipeline": [ //... { "harvest": { "searchCycle_secs": 3600, "maxDocs_perCycle": 100 } }, //... ]
In the first case, the control block is simply restricting the number of documents that can be processed in a single cycle. The harvest will keep running cycles for this source until there are no more documents to ingest. There are a few reasons why you might want to do this, eg:
- To reduce memory usage, eg if the documents are large
- To reduce the latency with which documents are available to the application
In the second case, you are restricting the number of documents to be ingested to 100 per hour. This might be desirable eg if you are accessing a rate limited service.
Distributing a Single Source Across Multiple Threads
"processingPipeline": [ //... { "harvest": { "distributionFactor": 20 } }, //... ]
The distributionFactor
parameter will split the source into that number of "sub-sources" that are treated (mostly) like individual sources.
In the example, distributionFactor
has been set to 20. Each thread will typically grab 2 sub-sources and process them sequentially. As a result, to distribute perfectly a source across N nodes running T threads you would need a distribution factor of 2*N*T
The configuration in the example above would distribute on a default 2 API node system.
A new split/cycle will not occur till all sub-sources from the previous cycle have completed. Therefore you will see sub-linear performance where the sub-sources take different amount of times to finish.
Background
The original idea behind Infinit.e was that there would be lots of relatively small sources continuously feeding the platform. As a result, we fixed each source to running in a single thread. (Administrators can control the number of threads, see the configuration template documentation, by default there are 5 threads per file type per node).
Now we often have a smaller number of larger sources so have been experimenting with ways to load balance fragments across the different threads and nodes.