Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There is a seperate reference for the Feed Harvester configuration object.

...

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

...

The "extraMeta" array works identically to the "meta" array in the unstructured analysis harvester, except that the metadata is not appended to any documents, ie after it has been passed to "script" to generate URL links it is discarded.

The "rss.searchConfig.script" javascript (ie the last element in the processing chain) can access the fields created from "extraMeta" (ie extraMeta[*].fieldName) from the "_metadata" variable that is automatically passed in if no flags are specified (otherwise make sure the "m" flag is specified - or equivalently "d" to use "_doc,metadata").