Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There is a seperate reference for the Feed Harvester configuration object.

...

Advanced use of searchConfig

There are 2 main differences in using "searchConfig" to parse HTML:
  • The javascript has HTML needs to be parsed - this is discussed below ("using xpath to parse the HTML, eg using regular expressions. This is much more work, but there is currently no way to use friendlier technologies such as xpath (or DOM).HTML")
  • It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
    • numPages: the total number of pages that will be checked each search cycle.
    • pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
    • pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
    • (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.

...

Finally, it is more likely that standard web-crawling measures are needed such as custom user-agents, and per-page wait times. Because these might well be different from the search engine to the pages themselves, "searchConfig" has its own "waitTimeBetweenPages_ms", "userAgent" fields (if not specified these are inherited from the parent "rss" object).

Using Xpath to parse HTML

The "searchConfig" object has a field "extraMeta" that enables other script types to be used. The main use case for this is using the "xpath" scripting language (with "groupNum": -1 to generate objects) to extract the URLs required, and then use the existing "script" field (with "scriptflags": "m") to tidy up those objects into the required "url"/"title"/"description"/"publishedData" format.

The "extraMeta" array works identically to the "meta" array in the unstructured analysis harvester, except that the metadata is not appended to any documents, ie after it has been passed to "script" to generate URL links it is discarded.