...
Code Block |
---|
language | javascript |
---|
title | Feed Harvester configuration |
---|
|
"rss": {
"feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
"waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms):
// ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher
// for large sites you can increase the performance of the harvester by setting this number lower
"updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
"regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested
"regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested
"extraUrls": [ // This array allows for manually specified URLs to be harvested once
{
"url": string, // The URL
"title": string, // The title that the document will be given (ie the equivalent to the RSS title)
"description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description)
"fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging)
},
//etc
],
"userAgent": string, // (Optional) If present overrides the system default user agent string
"searchConfig": { ... } // (Optional) A complex configuration object that allows the contents of URLs to be used generate more URLs/docs to harvest
} |
The "searchConfig" field is described in more detail TODO. Here is its format for reference:
Code Block |
---|
|
"searchConfig": {
"userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)
"script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format:
// [ { "url": string, "title": string /* optional-ish */,
// "description": string /* optional */, publishedDate: string /* optional */,
// "spiderOut": string /*optional */ }
"scriptlang": string, // (Mandatory) Currently only "javascript" is supported
"pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace)
// Also, group 1 should be the start, to allow any offsets specified in the URL to be respected
"pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage
"numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go
"numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page
"waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set
"maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go
} |