Code Block

language	js

{
	"display": string,
	"feed": {
	    "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
 
    	"waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms):
        	            // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher
            	        // for large sites you can increase the performance of the harvester by setting this number lower
    	"updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
    	"regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested
    	"regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested
     
    	"extraUrls": [ // This array allows for manually specified URLs to be harvested once
        {
            "url": string // The URL 
		}
    	],
	    "userAgent": string, // (Optional) If present overrides the system default user agent string
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
		"httpFields": // (Optional) Additional HTTP fields to be applied to the request headers  
		{ 
			"field": "value" // eg "cookie": "sessionkey=346547657687"
		}
	} 
}

Description

The Feed Harvester Extractor will connect to and extract data from an RSS feed.

It uses feedType to specify that the data source is RSS. It connects to the specified urls and can either include or exclude specified urls via regex using regexInclude or regexExclude.

Field	Description
feedType	Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported)
waitTimeOverride_ms	Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower.
updateCycle_secs	Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls"
regexInclude	Optional - if specified, only URLs matching the regex will be harvested
regexExclude	Optional - if specified, any URLs matching the regex will not be harvested
extraUrls	This array allows for manually specified URLs to be harvested once { "url": string // The URL
userAgent	(Optional) If present overrides the system default user agent string
proxyOverride	(Optional) "direct" to bypass proxy (the default), or a proxy specification "(http\|socks)://host:port"
httpFields	(Optional) Additional HTTP fields to be applied to the request headers

Examples

ExtraUrls

In the following example, the web extractor is used to run extraUrls parameter against the web content.

...

In the example, the manually harvested url to be harvested once is specified. The title, description, publisheddate, and fullText parameters can be used to manually specify strings for the specified urls, rather than using the harvested RSS data.

Refreshing URLs

In this example,the updateCycle_secs parameter is also used to specify the refresh rate of the harvested urls.

...

Versions Compared

Old Version 12

New Version 13

Key

Description

Examples

ExtraUrls

Refreshing URLs

Page Comparison

Versions Compared

Old Version 12

New Version 13

Key

Description

Examples

ExtraUrls

Refreshing URLs