Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

FieldDescription
feedType

Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)

waitTimeOverride_ms

Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower.

updateCycle_secs

Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"

regexInclude

Optional - if specified, only URLs matching the regex will be harvested

regexExclude

Optional - if specified, any URLs matching the regex will not be harvested

extraUrls

Complex Type

"url": string, // The URL 

"title": string, // The title that the document will be given (ie the equivalent to the RSS title). See below.

Warning

url must always be set

title must be set unless the "Follow web links" element is also used, and then this page will only be crawled for links, it will not be harvested.

"description": string, // (Optional) The description that the document will be given (ie the equivalent to the RSS description)

"publishedData": string, // (Optional) The date that will be assigned to the document (default: now) - this can be overridden from "structuredAnalysis" "fullText": string // (Optional) If present and "useTextExtractor" is "none", then uses the specified string instead of the URL contents (mainly for debugging)

"fullText: string , // (Optional) Can be used to pre-populate content - mostly useful for debugging

userAgent

(Optional) If present overrides the system default user agent string

proxyOverride

(Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port"

httpFields

(Optional) Additional HTTP fields to be applied to the request headers  

Can contain the special field "Content", which will POST the associated value.

...