Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

There is a seperate reference for the Feed Harvester configuration object.

Infinit.e supports harvesting data from RSS feeds in a number of common formats (Atom, RSS 1.0, RSS 2.0, etc.). The Feed Harvester also allows for collection of specified URLs, and link scraping.

The Sample Feed Harvester Specification below demonstrates how to connect to and extract data from a feed using the harvester:

Sample Feed Harvester Specification
source : {
   ... 
   "extractType" : "Feed",
   "authentication" : {
       "username" : "username", 
       "password" : "password"},
   "url" : "http://www.mayoclinic.com/rss/blog.xml",
   "waitTimeOverride_ms": 10000, // (a standard "politeness" sleep for consecutive accesses to the same web-site, system default is 10s)
   ...
   // "Advanced" control functionality
   "updateCycle_secs": 86400, // If specified (eg value shown is 1 day) then will re-extract the URL document with that periodicity
   "regexInclude": ".*" // (Optional) regular expression, anything not matching if discarded
   "regexExclude": ".*\\.pdf", // (Optional) eg this example will discard PDFs
   // "Advanced" extraction functionality
   "userAgent": "for emulating a specific browser, defaults to FireFox",
   "extraUrls": {...}, // See the reference - for collecting specified URLs
   "searchConfig": { ... } // See the reference and the description below - for link scraping
   ...
}

Note: A complete example of the above source including a sample feed document harvested from the source can be found here: Feed Source.

  • extractType
    The extractType field is used to tell the harvester the type of source to extract from, i.e.: Feed. Other valid values include: Database, Feed, etc.
  • authentication (optional)
    The Authentication object of the Source document is a subset of the full Authentication object in that it only uses the 'username' and 'password' fields. The Feed Harvester uses the username and password from the Authentication object as feed credentials (if needed).
    • username
    • password
      Note: The password field in the Authentication object is currently clear text. If the string value placed in password is clear text it is not encrypted by Infinit.e. Encryption of the password field is planned for a future release.
  • url
    The URL to retrieve the RSS feed from.
  • extraUrls
    Allows collection of specified URLs
  • searchConfig
    Described below

Web-crawling and similar activities

TODO

When Javascript is used, the same security restrictions as elsewhere apply.

 

  • No labels