Using the Feed Harvester

There is a seperate reference for the Feed Harvester configuration object.

Infinit.e supports harvesting data from RSS feeds in a number of common formats (Atom, RSS 1.0, RSS 2.0, etc.). The Feed Harvester also allows for collection of specified URLs, and link scraping.

The Sample Feed Harvester Specification below demonstrates how to connect to and extract data from a feed using the harvester:

Sample Feed Harvester Specification
source : {
   ... 
   "extractType" : "Feed",
   "authentication" : {
       "username" : "username", 
       "password" : "password"},
   "url" : "http://www.mayoclinic.com/rss/blog.xml",
   "rss": {
       "waitTimeOverride_ms": 10000, // (a standard "politeness" sleep for consecutive accesses to the same web-site, system default is 10s)
   ...
       // "Advanced" control functionality
       "updateCycle_secs": 86400, // If specified (eg value shown is 1 day) then will re-extract the URL document with that periodicity
       "regexInclude": ".*" // (Optional) regular expression, anything not matching if discarded
       "regexExclude": ".*\\.pdf", // (Optional) eg this example will discard PDFs
       // "Advanced" extraction functionality
       "userAgent": "for emulating a specific browser, defaults to FireFox",
       "extraUrls": {...}, // See the reference - for collecting specified URLs
       "searchConfig": { ... } // See the reference and the description below - for link scraping
    }
   ...
}

Note: A complete example of the above source including a sample feed document harvested from the source can be found here: Feed Source.

  • extractType
    The extractType field is used to tell the harvester the type of source to extract from, i.e.: Feed. Other valid values include: Database, Feed, etc.
  • authentication (optional)
    The Authentication object of the Source document is a subset of the full Authentication object in that it only uses the 'username' and 'password' fields. The Feed Harvester uses the username and password from the Authentication object as feed credentials (if needed).
    • username
    • password
      Note: The password field in the Authentication object is currently clear text. If the string value placed in password is clear text it is not encrypted by Infinit.e. Encryption of the password field is planned for a future release.
  • url
    The URL to retrieve the RSS feed from.
  • extraUrls
    Allows collection of specified URLs
  • searchConfig
    Described below

API parsing, Link following, web crawling, and similar activities

It will often be the case that a base URL will not contain useful content, but will contain links to useful content. Examples include:

  • Internet and Intranet searches, lists of PDFs of academic publications, etc.
  • Many site APIs (eg LinkedIn, TechCrunch, Flickr)
  • Directory or disambiguation pages (eg Wiki)

In addition, it will often be the case that pages with useful content also contain links to more pages with useful content, ie a standard web crawling issue.

In addition, it will often be the case that both XML and JSON APIs will return a single URL containing many "documents" ie independent content snippets.

In some cases pagination from APIs is achieved by passing a field containing the URL of the next page; in other cases the URL follows a standard pattern.

The "searchConfig" field of the Feed Harvester provides a nice interface for all aspects of handling JSON APIs (splitting replies into many documents, fetching links, and following "next page" links). It can also be used for more standard HTML link-following and web-crawling, though is harder to use in that context.

The remainder of this section describes the basic usage of the "searchConfig" object (ie suitable for API parsing), and then provides a brief description of the more complex activity of HTML parsing.

Basic use of searchConfig

The "script" field of the "searchConfig" object needs to contain javascript (so the "scriptlang" field should always be set to "javascript"). The javascript is passed a single variable "text", containing the response to the specified URL (or URLs is "extraUrls" is specified), and needs to return an array of the following objects:

{
	"url": string, // Mandatory - this URL is copied into the "URL" field of the generated document, 
					// and is used to fetch the content unless "fullText" is set.
	"title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
	"description": string, // Optional, if set then used to generate the document's description.
	"publishedDate": string, // Optional, if not set then the current date is used instead.
 
	"fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
	"spiderOut": integer //Optional, if set to true then the searchConfig.script is applied to the resulting document,
							// for a depth of up to "searchConfig.maxDepth" times
							// Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)
}

So the basic use of searchConfig on JSON-based APIs should be quite straightforward, ie along the following lines:

Outline API parsing using searchConfig
var json = eval('(' + text + ')');
var retval = [];
// For each "result" in the array
// Extract URL, title, description, eg for the flickr blogs API 
// (http://www.flickr.com/services/api/response.json.html)
for (x in json.blogs.blog) {
	var blog = json.blogs.blog[x];
	var retobj = { url: blog.url, title: blog.name };
	retval.push(retobj);
}
// Alternatively set retobj.fullText to specify the content from the API response
// In addition set retobj.spiderOut: true, to run this script on the corresponding URL, eg:
if (null != json.nextPageUrl) 
	retval.push({url: json.nextPageUrl, spiderOut: true});
retval; // annoying feature of our javascript engine, instead of returning you just evaluate the var to return

For XML APIs the basic principle is the same, but the XML object needs to be parsed using embedded Java calls (since the Rhino javascript engine currently in use does not support e4x - it is on our roadmap to upgrade to a version that does).

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

For example, consider a URL of the form:

  • http://www.blahblah.com/search?q=search_terms&page=1

Then the following parameters would be used: "pageChangeRegex": "(page=\d+)", "pageChangeReplace": "page=$1", "numResultsPerPage": 1

And for a URL of the form:

  • http://www.blahblahblah.com/search?q=search_terms&pagesize=20&start_result=0

The the following parameters would be used: "pageChangeRegex": "(start_result=\d+)", "pageChangeReplace": "start_result=$1", "numResultsPerPage": 20

Finally, it is more likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "searchConfig" has its own "waitTimeBetweenPages_ms", "userAgent" fields (if not specified these are inherited from the parent "rss" object).

Note that "fullText" can be set to a JSON object, and it is then converted into a string containing the JSON (ie ready to be converted back into JSON with eval) in the derived document. This is handy because Rhino does not support "JSON.stringify".

Using Xpath to parse HTML and XML

The "searchConfig" object has a field "extraMeta" that enables other script types to be used. The main use case for this is using the "xpath" scripting language (with "groupNum": -1 to generate objects) to extract the URLs required, and then use the existing "script" field (with "scriptflags": "m") to tidy up those objects into the required "url"/"title"/"description"/"publishedData" format.

The "extraMeta" array works identically to the "meta" array in the unstructured analysis harvester, except that the metadata is not appended to any documents, ie after it has been passed to "script" to generate URL links it is discarded.

The "rss.searchConfig.script" javascript (ie the last element in the processing chain) can access the fields created from "extraMeta" (ie extraMeta[*].fieldName) from the "_metadata" variable that is automatically passed in if no flags are specified (otherwise make sure the "m" flag is specified - or equivalently "d" to use "_doc,metadata").

The "extraMeta" field can also be used for 2 debugging/error handling cases:

  • If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
  • Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

 

Example uses of ONERROR and ONDEBUG
 "rss": {
       "searchConfig": {
           "extraMeta": [
               {
                   "context":"First",
                   "fieldName":"_ONERROR_",
                   "scriptlang":"javascript",
                   "script":"var page = text; page;"
               },
               {
                   "context":"First",
                   "fieldName":"title", // (eg)
                   "scriptlang":"javascript",
					//... (can return string or object)
               },
			   {
                   "context":"First",
                   "fieldName":"_ONDEBUG_",
                   "scriptlang":"javascript",
                   "flags":"m",
                   "script":"var ret = _metadata.title; ret;"
               },
//...