Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Gliffy
chromemin
nameFollow Web Links 2

 

 

The following table describes the parameters of the Follow Web Links configuration.

ParameterDescription
userAgent

(Optional) Overrides the "parent" (rss) setting for "search" operations (see usage guide)

proxyOverride

(Optional) "direct" to bypass proxy (the default), or a proxy specification "(http|socks)://host:port"

script

(Mandatory) Script, must "return" (last statement evaluated) an array of the following format: // [ { "url": string, "title": string /* optional-ish */, // "description": string /* optional */, publishedDate: string /* optional */, // "spiderOut": string /*optional */ }

scriptlang

(Mandatory) Only "javascript" is supported, use extraMeta for different script types

scriptflags

(Optional) The flags to apply to the above script, see "unstructuredAnalysis.meta" for more details

extraMeta

(Optional) A pipeline of metadata extraction operations that are applied prior to "script", see "Using The Feed Harvester" overview

pageChangeRegex

(Optional) If non-null, this regex should be used to match the pagination URL parameter (which will be replaced by pageChangeReplace) // Also, group 1 should be the start, to allow any offsets specified in the URL to be respected

pageChangeReplace

(Optional) Mandatory if pageChangeRegex is non-null, must be a replace string where $1 is the page*numResultsPerPage

numPages

(Optional) Mandatory if pageChangeRegex is non-null - controls the number of pages deep the search will go

stopPaginatingOnDuplicate

(Ignored unless pageChangeRegex is non-null) If true (default: false) then will stop harvesting as soon as an already harvested link is encountered // (for APIs that return docs in time order, ensures that no time is wasted harvesting and then discarding duplicate links)

numResultsPerPage

(Optional) Mandatory if pageChangeRegex is non-null - controls the number of results per page

waitTimeBetweenPages_ms

(Optional) Only used if pageChangeRegex is non-null - controls a wait between successive pages if set

maxDepth

(Optional, defaults to 2) If spidering out (returning "spiderOut": "true" from the script) the maximum depth to go

 

Examples

Scriptlang is set to javascript, to enable "Follow Web Links" to parse the additional urls.  The script field is passed a variable called "text" which returns the results of the specified url (original document).

...