Follow web links ("links") - just uses the userAgent/proxyOverride/httpFields to control any HTTP requests that are made either from the subsequent "split document" elements
Split documents ("split") - takes the current document and uses its metadata to create new documents (by outputting an object in the format immediatey immediately below). The current document can then be retained or discarded. Newly spawned documents appear at the current point in the pipeline, ie skip by earlier elements.

See under "searchConfig"

TODO

Description

Follow Web Links can be used for API parsing (JSON, XML) or for more advanced HTML parsing.

It takes as its input documents that have been generated by an extractor, and then creates new documents based on the url links. The original documents can then be retained or discarded.

Gliffy

border	true

chrome	min
size	300
name	Follow Web Links 2

API Parsing

Scriptlang is set to javascript, to enable "Follow Web Links" to parse the additional urls. The script field is passed a variable called "text" which returns the results of the specified url (original document).

For each document an array of the following objects is populated.

Code Block

 "url": string, // Mandatory - this URL is copied into the "URL" field of the generated document,                     // and is used to fetch the content unless "fullText" is set.
    "title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
    "description": string, // Optional, if set then used to generate the document's description.
    "publishedDate": string, // Optional, if not set then the current date is used instead.
 
    "fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
    "spiderOut": integer //Optional, if set to true then the searchConfig.script is applied to the resulting document,
                            // for a depth of up to "searchConfig.maxDepth" times
                            // Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)

HTML Parsing

IN PROGRESS

Legacy documentation:

See under basic/advanced "use of searchConfig"

...

Versions Compared

Old Version 3

New Version 4

Key

Description

API Parsing

HTML Parsing

Page Comparison

Versions Compared

Old Version 3

New Version 4

Key

Description

API Parsing

HTML Parsing