Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There is a seperate reference for the Feed Harvester configuration object.

...

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

...

Finally, it is more likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "searchConfig" has its own "waitTimeBetweenPages_ms", "userAgent" fields (if not specified these are inherited from the parent "rss" object).

Info

Note that "fullText" can be set to a JSON object, and it is then converted into a string containing the JSON (ie ready to be converted back into JSON with eval) in the derived document. This is handy because Rhino does not support "JSON.stringify".

Using Xpath to parse HTML and XML

The "searchConfig" object has a field "extraMeta" that enables other script types to be used. The main use case for this is using the "xpath" scripting language (with "groupNum": -1 to generate objects) to extract the URLs required, and then use the existing "script" field (with "scriptflags": "m") to tidy up those objects into the required "url"/"title"/"description"/"publishedData" format.

...