Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Follow web links ("links") - just uses the userAgent/proxyOverride/httpFields to control any HTTP requests that are made either from the subsequent "split document" elements
  • Split documents ("split") - takes the current document and uses its metadata to create new documents (by outputting an object in the format immediately below). The current document can then be retained or discarded. Newly spawned documents appear at the current point in the pipeline, ie skip by earlier elements.

This page has been broken into the following sections for ease of localization.

Table of Contents

 

TODO

Format

Code Block
{
	"display": string,
	"links": {
	    "userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
	    "script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format:
		                // [ { "url": string, "title": string /* optional-ish */, 
		                //     "description": string /* optional */, publishedDate: string /* optional */,
		                //     "spiderOut": string /*optional */ }
	    "scriptlang": string, // (Mandatory) Only "javascript" is supported, use extraMeta for different script types
	    "scriptflags": string, // (Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details
	    "extraMeta": [ {...} ], // (Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview
	    "pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace)
				                    // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected
	    "pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage
	    "numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go
	    "stopPaginatingOnDuplicate": boolean, // (Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered
	                                            // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links)
	    "numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page
	    "waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set
 
	    "maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go
	}
}

...

When spiderOut is enabled it will continue to follow additional urls (if present) until the parameter setting for maxDepth has been reached.

Gliffy
nameMax Depth

...

 

Code Block
var json = eval('(' + text + ')');
var retval = [];
// For each "result" in the array
// Extract URL, title, description, eg for the flickr blogs API 
// (http://www.flickr.com/services/api/response.json.html)
for (x in json.blogs.blog) {
    var blog = json.blogs.blog[x];
    var retobj = { url: blog.url, title: blog.name };
    retval.push(retobj);
}
// Alternatively set retobj.fullText to specify the content from the API response
// In addition set retobj.spiderOut: true, to run this script on the corresponding URL, eg:
if (null != json.nextPageUrl) 
    retval.push({url: json.nextPageUrl, spiderOut: true});
retval; // annoying feature of our javascript engine, instead of returning you just evaluate the var to return

Note

For XML APIs the basic principle is the same, but the XML object needs to be parsed using embedded Java calls (since the Rhino javascript engine currently in use does not support e4x - it is on our roadmap to upgrade to a version that does).

 

When Follow Web Links is used to ingest HTML pages, some additional considerations are required for HTML.

The extraMeta field is used by "Follow Web Links" to enable other script types to be used.

This enables the use of the xpath scripting language to extract the required urls from the HTML pages, before script can put them into the required array format.

After the extraMeta array has been used by script, it is discarded.

The extraMeta field can also be used for 2 debugging/error handling cases:

  • If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
  • Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

Example:

Code Block
"rss": {
       "searchConfig": {
           "extraMeta": [
               {
                   "context":"First",
                   "fieldName":"_ONERROR_",
                   "scriptlang":"javascript",
                   "script":"var page = text; page;"
               },
               {
                   "context":"First",
                   "fieldName":"title", // (eg)
                   "scriptlang":"javascript",
                    //... (can return string or object)
               },
               {
                   "context":"First",
                   "fieldName":"_ONDEBUG_",
                   "scriptlang":"javascript",
                   "flags":"m",
                   "script":"var ret = _metadata.title; ret;"
               },
//...

 

Follow Web Links can be used to follow links when multiple pages must be traversed (results pages for Intranet search pages).  Follow Web Links can then generate new documents based on these urls.

The following parameters are part of the configuration

  • numPages: the total number of pages that will be checked each search cycle.
  • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
  • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
  • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

 

Examples:

For example, consider a URL of the form:

  • http://www.blahblah.com/search?q=search_terms&page=1

Then the following parameters would be used: "pageChangeRegex": "(page=\d+)", "pageChangeReplace": "page=$1", "numResultsPerPage": 1

And for a URL of the form:

  • http://www.blahblahblah.com/search?q=search_terms&pagesize=20&start_result=0

The the following parameters would be used: "pageChangeRegex": "(start_result=\d+)", "pageChangeReplace": "start_result=$1", "numResultsPerPage": 20

it is likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "Follow Web Links" has its own waitTimeBetweenPages_ms, and userAgent fields (if not specified these are inherited from the parent object).

 

IN PROGRESS

 

Legacy documentation:

...