Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When Follow Web Links is used for API-style parsing, scriptlang must be set to "javascript."  You can then specify a javascript type script for the script field, which will be passed a variable "text," containing the response to the specified url.  The script must ouput an array in the following format:

 

Code Block
{ 
	"url": string, // Mandatory - this URL is copied into the "URL" field of the generated document,                     
					// and is used to fetch the content unless "fullText" is set.
    "title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
    "description": string, // Optional, if set then used to generate the document's description.
    "publishedDate": string, // Optional, if not set then the current date is used instead.
 
    "fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
    "spiderOut": integer //Optional, if set to true then the Follow Web Links script is applied to the resulting document,
                            // for a depth of up to "maxDepth" times
                            // Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)
}
Warning

It is important to understand the 2 control parameters in this object:

  • If "title" is set, the document is forwarded to the rest of the pipeline (ie will be harvested into the platform, as if the "follow web links" element were not present)
  • If "title" is not set, the document will be discarded after it has been scanned for links to crawl
  • If "spiderOut" is set to true, then the document will be scanned for links to crawl
    • (so eg no "title" and "spiderOut" is a degenerate case)

 

spiderOut can be used to apply the Follow Web Links script to the resulting document.  This means that if the newly generated document also contains additional urls, the script will run again on these urls and return the array to make additional documents.

...