Harvest control settings

Overview

Controls a set of generic source management parameters.

Format

{
	"display": string, 
	"harvest": {
		"searchCycle_secs":integer, // How often to run the harvester (copied to SourcePojo when published)
		"duplicateExistingUrls":Boolean,// If false (defaults to true) then documents matching the URL of any existing document in the community is ignored (copied to SourcePojo when published)
		
		"maxDocs_global":Integer, // If specified, limits the total number of documents that can be harvested for a given source - when new documents are harvested exceeding this limit, older documents are deleted to maintain the size
		"throttleDocs_perCycle":Integer, // If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS_ITERATION ie the next harvest cycle, the harvester will pick up again, as above)
		"maxDocs_perCycle":Integer, // If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS - ie this+searchCycle_secs limits document ingest rate)
		
		"distributionFactor":Integer, // (EXPERIMENTAL) If specified, attempts to distribute the source across many threads
	}
}

 

Description

The following table describes the parameters of the harvest control settings configuration.

FieldDescription
searchCycle_secsOptional, if set then the source will only be harvested every "searchCycle_secs" seconds (eg set to 86400 to recheck source daily, set to -1, or - the current value, to disable source temporarily
duplicateExistingUrls

Optional, if false then this source will never duplicate existing documents within the community, matching solely on URL ie even if the processing performed is different

(note by default, deduplication on URL occurs within a source but not within a community)

maxDocs_global

If specified, limits the total number of documents that can be harvested for a given source - when new documents are harvested exceeding this limit, older documents are deleted to maintain the size

throttleDocs_perCycle

If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS_ITERATION ie the next harvest cycle, the harvester will pick up again)

maxDocs_perCycle

If specified, limits the number of documents that can be harvested for a given source (state moves to SUCCESS - ie this+searchCycle_secs limits document ingest rate - the source will not be re-ingested until )

distributionFactor

(EXPERIMENTAL) If specified, attempts to distribute the source across the number of threads specified

Examples

Basic

The following code sample exemplifies the basic functionality of the Harvest control settings:

"processingPipeline": [
	//...
	{
		"harvest": {
			"searchCycle_secs": 3600,
			"duplicateExistingUrls": false,
			"maxDocs_global": 5000
		}
	},
	//...
]

This control sets the following behavior:

  • The source will only be run every hour
  • If a document's URL already exists from any other source in that community, the new document will be discarded
  • Only the most recent 5000 documents will be retained


Throttling Per Cycle

 The following examples compare the different use cases for configuring throttledocs_perCycle and maxdocs_perCycle.

Compare:

"processingPipeline": [
	//...
	{
		"harvest": {
			"searchCycle_secs": 3600,
			"throttleDocs_perCycle": 100
		}
	},
	//...
]

vs

"processingPipeline": [
	//...
	{
		"harvest": {
			"searchCycle_secs": 3600,
			"maxDocs_perCycle": 100
		}
	},
	//...
]

In the first case, the control block is simply restricting the number of documents that can be processed in a single cycle. The harvest will keep running cycles for this source until there are no more documents to ingest. There are a few reasons why you might want to do this, eg:

  • To reduce memory usage, eg if the documents are large
  • To reduce the latency with which documents are available to the application

In the second case, you are restricting the number of documents to be ingested to 100 per hour. This might be desirable eg if you are accessing a rate limited service.

 


Distributing a Single Source Across Multiple Threads

 

"processingPipeline": [
	//...
	{
		"harvest": {
			"distributionFactor": 20
		}
	},
	//...
]

 

The distributionFactor parameter will split the source into that number of "sub-sources" that are treated (mostly) like individual sources.

In the example, distributionFactor has been set to 20.  Each thread will typically grab 2 sub-sources and process them sequentially.  As a result, to distribute perfectly a source across N nodes running T threads you would need a distribution factor of 2*N*T

The configuration in the example above would distribute on a default 2 API node system.

A new split/cycle will not occur till all sub-sources from the previous cycle have completed. Therefore you will see sub-linear performance where the sub-sources take different amount of times to finish.

 

Currently only tested for the file harvester, where files are split according to a hash of their filenamesThere is an implementation for the database harvester, but it is not currently tested or supported, and requires use of an extra parameter - users interested in playing with it should consult the source code)

Background

The original idea behind Infinit.e was that there would be lots of relatively small sources continuously feeding the platform. As a result, we fixed each source to running in a single thread. (Administrators can control the number of threads, see the configuration template documentation, by default there are 5 threads per file type per node).

Now we often have a smaller number of larger sources so have been experimenting with ways to load balance fragments across the different threads and nodes.

 

Footnotes:

Legacy documentation: