Feed extractor
Overview
Extracts documents from RSS and Atom feeds. Note that the resulting documents normally need to be passed through a text extraction stage before further processing. One exception is using the Follow Web Links processing element to extract more documents eg from hyperlinks in the original web pages.
Format
{ "display": string, "feed": { "feedType": string, // Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported) "waitTimeOverride_ms": integer, // Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower "updateCycle_secs": integer, // Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls" "regexInclude": string, // Optional - if specified, only URLs matching the regex will be harvested "regexExclude": string, // Optional - if specified, any URLs matching the regex will not be harvested "extraUrls": [ // This array allows for manually specified URLs to be harvested once { "url": string // The URL } ], "userAgent": string, // (Optional) If present overrides the system default user agent string "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" "httpFields": // (Optional) Additional HTTP fields to be applied to the request headers { "field": "value" // eg "cookie": "sessionkey=346547657687" } } }
Description
The Feed Extractor will connect to and extract data from an RSS feed.
It uses feedType
to specify that the data source is RSS. It connects to the specified urls and can either include or exclude specified urls via regex using regexInclude
or regexExclude
.
The following table describes the parameters of the feed extractor configuration.
Field | Description |
---|---|
feedType | Currently not used - will allow for RSS vs Atom in future releases (currently only RSS is supported) |
waitTimeOverride_ms | Optional - if specified, controls the amount of time between successive reads to a site (default: 10000ms): // ie if a site is timing out it may limit the number of accesses from a given IP - set the number higher // for large sites you can increase the performance of the harvester by setting this number lower. |
updateCycle_secs | Optional - if present harvested URLs may be replaced if they are older than this time and are encountered from the RSS or in the "extraUrls" |
regexInclude | Optional - if specified, only URLs matching the regex will be harvested |
regexExclude | Optional - if specified, any URLs matching the regex will not be harvested |
extraUrls | This array allows for manually specified URLs to be harvested once { "url": string // The URL |
userAgent | (Optional) If present overrides the system default user agent string |
proxyOverride | (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" |
httpFields | (Optional) Additional HTTP fields to be applied to the request headers |
Examples
ExtraUrls
In the following feed example, the web extractor is used to run extraUrls
parameter against the feed. ExtraUrls
is a complex type that enables urls to be manually specified, overriding settings that would be provided by the RSS feed. Additionally, in this example, text extraction is performed using textEngine
and featureEngine.
{ "description": "Article on Medical Issues", "harvestBadSource": false, "isApproved": true, "isPublic": true, "key": "http.www.mayoclinic.com.rss.blog.xml", "mediaType": "News", "modified": "Oct 19, 2010 11:31:59 AM", "tags": [ "topic:healthcare", "industry:healthcare", "mayo clinic", "health" ], "title": "MayoClinic: General Topics", "processingPipeline": [ { "feed": { "extraUrls": [ { "url": "http://www.mayoclinic.com/rss/blog.xml" } ] } }, { "textEngine": { "engineName": "AlchemyAPI" } }, { "featureEngine": { "engineName": "OpenCalais" } } ] }
Refreshing URLs
In this example,the updateCycle_secs
parameter is also used to specify the refresh rate of the harvested urls.
{ "description": "wiy", "isPublic": true, "mediaType": "News", "tags": [ "tag1" ], "title": "aaa xml test", "processingPipeline": [ { "feed": { "extraUrls": [ { "url": "http://www.w3schools.com/xml/simple.xml" } ], "updateCycle_secs": 86400 } },
Footnotes:
Legacy documentation:
Feed object