Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There is a seperate reference for the Feed Harvester configuration object.

...

Code Block
{
	"url": string, // Mandatory - this URL is copied into the "URL" field of the generated document, 
					// and is used to fetch the content unless "fullText" is set.
	"title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
	"description": string, // Optional, if set then used to generate the document's description.
	"publishedDate": string, // Optional, if not set then the current date is used instead.
 
	"fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
	"spiderOut": integer //Optional, if set to true then the searchConfig.script is applied to the resulting document,
							// for a depth of up to "searchConfig.maxDepth" times
							// Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)
}

So the basic use of searchConfig on JSON-based APIs should be quite straightforward, ie along the following lines:

...

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

...