Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Follow web links ("links") - just uses the userAgent/proxyOverride/httpFields to control any HTTP requests that are made either from the subsequent "split document" elements
  • Split documents ("split") - takes the current document and uses its metadata to create new documents (by outputting an object in the format immediatey immediately below). The current document can then be retained or discarded. Newly spawned documents appear at the current point in the pipeline, ie skip by earlier elements.

...

TODO

Description

Follow Web Links can be used for API parsing (JSON, XML) or for more advanced HTML parsing.

It takes as its input documents that have been generated by an extractor, and then creates new documents based on the url links.  The original documents can then be retained or discarded.

Gliffy
bordertrue
chromemin
size300
nameFollow Web Links 2

 

Scriptlang is set to javascript, to enable "Follow Web Links" to parse the additional urls.  The script field is passed a variable called "text" which returns the results of the specified url (original document).

For each document an array of the following objects is populated.

Code Block
 "url": string, // Mandatory - this URL is copied into the "URL" field of the generated document,                     // and is used to fetch the content unless "fullText" is set.
    "title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
    "description": string, // Optional, if set then used to generate the document's description.
    "publishedDate": string, // Optional, if not set then the current date is used instead.
 
    "fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
    "spiderOut": integer //Optional, if set to true then the searchConfig.script is applied to the resulting document,
                            // for a depth of up to "searchConfig.maxDepth" times
                            // Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)

 

 

 

 

 

IN PROGRESS

 

Legacy documentation:

...