Feed object
JSON format
Note that there is a separate overview of how to use the Feed Harvester. This page is reference information.
Feed Harvester configuration
"rss": { "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported) "waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower "updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls" "regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested "regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested "extraUrls": [ // This array allows for manually specified URLs to be harvested once { "url": string, // The URL "title": string, // The title that the document will be given (ie the equivalent to the RSS title) "description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description) "publishedData": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis" "fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging) }, //etc ], "userAgent": string, // (Optional) If present overrides the system default user agent string "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" "cookies": string, // (Optional) appends this string to the "Cookies" field (can included multiple semi-colon separated cookie values) "searchConfig": { ... } // (Optional) A complex configuration object that allows the contents of URLs to be used generate more URLs/docs to harvest }
The "searchConfig" field is described in more detail here. Here is its format for reference:
"searchConfig": { "userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide) "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" "cookies": string, // (Optional) appends this string to the "Cookies" field (can included multiple semi-colon separated cookie values). If not specified then uses rss.cookies (if specified). "globals": string, // Optional Javascript that is evaluated before script or extraMeta (ie to define global functions) "script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format: // [ { "url": string, "title": string /* optional-ish */, // "description": string /* optional */, publishedDate: string /* optional */, // "spiderOut": string /*optional */ } "scriptlang": string, // (Mandatory) Only "javascript" is supported, use extraMeta for different script types "scriptflags": string, // (Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details "extraMeta": [ {...} ], // (Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview "pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace) // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected "pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage "numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go "stopPaginatingOnDuplicate": boolean, // (Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links) "numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page "waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set "maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go }