Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
titleSample Feed Harvester Specification
source : {
   ... 
   "extractType" : "Feed",
   "authentication" : {
       "username" : "username", 
       "password" : "password"},
   "url" : "http://www.mayoclinic.com/rss/blog.xml",
   "rss": {
       "waitTimeOverride_ms": 10000, // (a standard "politeness" sleep for consecutive accesses to the same web-site, system default is 10s)
   ...
       // "Advanced" control functionality
       "updateCycle_secs": 86400, // If specified (eg value shown is 1 day) then will re-extract the URL document with that periodicity
       "regexInclude": ".*" // (Optional) regular expression, anything not matching if discarded
       "regexExclude": ".*\\.pdf", // (Optional) eg this example will discard PDFs
       // "Advanced" extraction functionality
       "userAgent": "for emulating a specific browser, defaults to FireFox",
       "extraUrls": {...}, // See the reference - for collecting specified URLs
       "searchConfig": { ... } // See the reference and the description below - for link scraping
    }
   ...
}

Note: A complete example of the above source including a sample feed document harvested from the source can be found here: Feed Source.

...

Code Block
languagejavascript
titleOutline API parsing using searchConfig
var json = eval('(' + text + ')');
var retval = [];
// For each "result" in the array
// Extract URL, title, description, eg for the flickr blogs API 
// (http://www.flickr.com/services/api/response.json.html)
for (x in json.blogs.blog) {
	var blog = json.blogs.blog[x];
	var retobj = { url: blog.url, title: blog.name };
	retval.push(retobj);
}
// Alternatively set retobj.fullText to specify the content from the API response
// In addition set retobj.spiderOut: true, to run this script on the corresponding URL, eg:
if (null != json.nextPageUrl) 
	retval.push({url: json.nextPageUrl, spiderOut: true});
retval; // annoying feature of our javascript engine, instead of returning you just evaluate the var to return

For XML APIs the basic principle is the same, but the XML object needs to be parsed using embedded Java calls (since the Rhino javascript engine currently in use does not support e4x - it is on our roadmap to upgrade to a version that does).

Info

When Javascript is used, the same security restrictions as elsewhere apply.

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:

  • The javascript has to parse the HTML, eg using regular expressions. This is much more work, but there is currently no way to use friendlier technologies such as xpath (or DOM).
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig

...

TODO: IN PROGRESS

 

...

  • " are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

For example, consider a URL of the form:

Then the following parameters would be used: "pageChangeRegex": "(page=\d+)", "pageChangeReplace": "page=$1", "numResultsPerPage": 1.

And for a URL of the form:

  • http://www.blahblahblah.com/search?q=search_terms&pagesize=20&start_result=0

The the following parameters would be used: "pageChangeRegex": "(start_result=\d+)", "pageChangeReplace": "start_result=$1", "numResultsPerPage": 20

Finally, it is more likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "searchConfig" has its own "waitTimeBetweenPages_ms", "userAgent" fields (it not specified these are inherited from the parent "rss" object).