Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There is a seperate reference for the Feed Harvester configuration object.

...

Code Block
{
	"url": string, // Mandatory - this URL is copied into the "URL" field of the generated document, 
					// and is used to fetch the content unless "fullText" is set.
	"title": string, // Mandatory (unless "spiderOut" set, see below) - this is used to generate the document's title.
	"description": string, // Optional, if set then used to generate the document's description.
	"publishedDate": string, // Optional, if not set then the current date is used instead.
 
	"fullText": string, // Optional, if set then this is the content for the generated document, ie "url" is not followed.
 
	"spiderOut": integer //Optional, if set to true then the searchConfig.script is applied to the resulting document,
							// for a depth of up to "searchConfig.maxDepth" times
							// Note spiderOut only works if rss.extraUrls is non-empty (ie use that instead of url)
}

So the basic use of searchConfig on JSON-based APIs should be quite straightforward, ie along the following lines:

...

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

...

Finally, it is more likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "searchConfig" has its own "waitTimeBetweenPages_ms", "userAgent" fields (if not specified these are inherited from the parent "rss" object).

Info

Note that "fullText" can be set to a JSON object, and it is then converted into a string containing the JSON (ie ready to be converted back into JSON with eval) in the derived document. This is handy because Rhino does not support "JSON.stringify".

Using Xpath to parse HTML and XML

The "searchConfig" object has a field "extraMeta" that enables other script types to be used. The main use case for this is using the "xpath" scripting language (with "groupNum": -1 to generate objects) to extract the URLs required, and then use the existing "script" field (with "scriptflags": "m") to tidy up those objects into the required "url"/"title"/"description"/"publishedData" format.

The "extraMeta" array works identically to the "meta" array in the unstructured analysis harvester, except that the metadata is not appended to any documents, ie after it has been passed to "script" to generate URL links it is discarded.

The "rss.searchConfig.script" javascript (ie the last element in the processing chain) can access the fields created from "extraMeta" (ie extraMeta[*].fieldName) from the "_metadata" variable that is automatically passed in if no flags are specified (otherwise make sure the "m" flag is specified - or equivalently "d" to use "_doc,metadata").

The "extraMeta" field can also be used for 2 debugging/error handling cases:

  • If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
  • Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

 

Code Block
languagejavascript
titleExample uses of _ONERROR_ and _ONDEBUG_
 "rss": {
       "searchConfig": {
           "extraMeta": [
               {
                   "context":"First",
                   "fieldName":"_ONERROR_",
                   "scriptlang":"javascript",
                   "script":"var page = text; page;"
               },
               {
                   "context":"First",
                   "fieldName":"title", // (eg)
                   "scriptlang":"javascript",
					//... (can return string or object)
               },
			   {
                   "context":"First",
                   "fieldName":"_ONDEBUG_",
                   "scriptlang":"javascript",
                   "flags":"m",
                   "script":"var ret = _metadata.title; ret;"
               },
//...