...

Code Block

{
	"display": string,
	"links": {
	    "userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
	    "script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format:
		                // [ { "url": string, "title": string /* optional-ish */, 
		                //     "description": string /* optional */, publishedDate: string /* optional */,
		                //     "spiderOut": string /*optional */ }
	    "scriptlang": string, // (Mandatory) Only "javascript" is supported, use extraMeta for different script types
	    "scriptflags": string, // (Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details
	    "extraMeta": [ {...} ], // (Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview
	    "pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace)
				                    // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected
	    "pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage
	    "numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go
	    "stopPaginatingOnDuplicate": boolean, // (Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered
	                                            // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links)
	    "numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page
	    "waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set
 
	    "maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go
	}
}

Legacy documentation:

See under "searchConfig"

TODO

Description

Follow Web Links can be used for API parsing (JSON, XML) or for more advanced HTML parsing.

...

Gliffy


chrome	min
name	Follow Web Links 2

API-Style Parsing

Scriptlang is set to javascript, to enable "Follow Web Links" to parse the additional urls. The script field is passed a variable called "text" which returns the results of the specified url (original document).

...

When spiderOut is enabled it will continue to follow additional urls (if present) until the parameter setting for maxDepth has been reached.

Gliffy


name	Max Depth

Example

...

Code Block

var json = eval('(' + text + ')');
var retval = [];
// For each "result" in the array
// Extract URL, title, description, eg for the flickr blogs API 
// (http://www.flickr.com/services/api/response.json.html)
for (x in json.blogs.blog) {
    var blog = json.blogs.blog[x];
    var retobj = { url: blog.url, title: blog.name };
    retval.push(retobj);
}
// Alternatively set retobj.fullText to specify the content from the API response
// In addition set retobj.spiderOut: true, to run this script on the corresponding URL, eg:
if (null != json.nextPageUrl) 
    retval.push({url: json.nextPageUrl, spiderOut: true});
retval; // annoying feature of our javascript engine, instead of returning you just evaluate the var to return

...

The links parameter is configured using script to spiderout to the additional urls in the input document. In the example, the script parameter is used to determine the behavior of follow web links when an additional URL is found in the input data.

Code Block

 "links": {                "script": "var
 retVals = [];\nvar n = -1;\nvar url = 
_doc.url.replace(/[?].*/,\"\");\nvar start = 0;\nwhile (start < 
text.length) {\n    var end = text.indexOf('\\n', start);\n    if (end 
== -1) end = text.length;\n    var line = text.substr(start,end-1);\n   
 start = end + 1;    \n    \n    n++;\n    if (0 == n) continue;\n    
\n    var title = 'line #' + n.toString();\n    var url2 = url + '#' + 
n.toString();\n    var fullText = line;\n    var retVal = { 
'title':title, 'url':url2, 'fullText':line };\n    retVals.push(retVal);
 \n}\nretVals;\n      "

Info
For XML APIs the basic principle is the same, but the XML object needs to be parsed using embedded Java calls (since the Rhino javascript engine currently in use does not support e4x - it is on our roadmap to upgrade to a version that does).

...

The extraMeta field can also be used for 2 debugging/error handling cases:

If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

Example:

Code Block

"rss": {
       "searchConfig": {
           "extraMeta": [
               {
                   "context":"First",
                   "fieldName":"_ONERROR_",
                   "scriptlang":"javascript",
                   "script":"var page = text; page;"
               },
               {
                   "context":"First",
                   "fieldName":"title", // (eg)
                   "scriptlang":"javascript",
                    //... (can return string or object)
               },
               {
                   "context":"First",
                   "fieldName":"_ONDEBUG_",
                   "scriptlang":"javascript",
                   "flags":"m",
                   "script":"var ret = _metadata.title; ret;"
               },
//...

cases:

If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

Example:

-insert example for parsing html

Working with Multiple Pages

...

Versions Compared

Old Version 9

New Version 10

Key

Description

API-Style Parsing

Example

Working with Multiple Pages

Page Comparison

Versions Compared

Old Version 9

New Version 10

Key

Description

API-Style Parsing

Example

Working with Multiple Pages