Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There is a seperate reference for the Feed Harvester configuration object.

...

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

...

The "rss.searchConfig.script" javascript (ie the last element in the processing chain) can access the fields created from "extraMeta" (ie extraMeta[*].fieldName) from the "_metadata" variable that is automatically passed in if no flags are specified (otherwise make sure the "m" flag is specified - or equivalently "d" to use "_doc,metadata").

The "extraMeta" field can also be used for 2 debugging/error handling cases:

  • If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
  • Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.

 

Code Block
languagejavascript
titleExample uses of _ONERROR_ and _ONDEBUG_
 "rss": {
       "searchConfig": {
           "extraMeta": [
               {
                   "context":"First",
                   "fieldName":"_ONERROR_",
                   "scriptlang":"javascript",
                   "script":"var page = text; page;"
               },
               {
                   "context":"First",
                   "fieldName":"title", // (eg)
                   "scriptlang":"javascript",
					//... (can return string or object)
               },
			   {
                   "context":"First",
                   "fieldName":"_ONDEBUG_",
                   "scriptlang":"javascript",
                   "flags":"m",
                   "script":"var ret = _metadata.title; ret;"
               },
//...