Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This page has been broken into the following sections for ease of localization.

Table of Contents

 

TODO

Format

Code Block
{
	"display": string,
	"links": {
	    "userAgent": string, // (Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)
	    "proxyOverride": string, // (Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port" 
	    "script": string, // (Mandatory) Script, must "return" (last statement evaluated) an array of the following format:
		                // [ { "url": string, "title": string /* optional-ish */, 
		                //     "description": string /* optional */, publishedDate: string /* optional */,
		                //     "spiderOut": string /*optional */ }
	    "scriptlang": string, // (Mandatory) Only "javascript" is supported, use extraMeta for different script types
	    "scriptflags": string, // (Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details
	    "extraMeta": [ {...} ], // (Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview
	    "pageChangeRegex": string, // (Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace)
				                    // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected
	    "pageChangeReplace": string, // (Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage
	    "numPages": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go
	    "stopPaginatingOnDuplicate": boolean, // (Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered
	                                            // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links)
	    "numResultsPerPage": integer, // (Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page
	    "waitTimeBetweenPages_ms": integer, // (Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set
 
	    "maxDepth": integer // (Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go
	}
}

...

it is likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "Follow Web Links" has its own waitTimeBetweenPages_ms, and userAgent fields (if not specified these are inherited from the parent object).

 

Panel

Footnotes:

 

 

IN PROGRESS