There is a seperate reference for the Feed Harvester configuration object.
...
Advanced use of searchConfig
There are 2 main differences in using "searchConfig" to parse HTML:
- The HTML needs to be parsed - this is discussed below ("using xpath to parse HTML")
- It will often be the case (eg for Intranet search engines) that multiple pages must be traversed (eg 10 results/page). The following sub-fields of "searchConfig" are intended to handle these cases:
- numPages: the total number of pages that will be checked each search cycle.
- pageChangeRegex: a regex that must have at least one capturing group and must match the entire part of the URL that controls the page number. See example below.
- pageChangeReplace: the above string that controls the page number, with $1 used to represent the page number.
- (slightly misnamed) numResultsPerPage: If the "page number" in the URL is actually a result offset and not a page offset, then this field should be the number of results per page (which is then multiplied by the page number to generate the "$1" string mentioned above). See example.
...
The "rss.searchConfig.script" javascript (ie the last element in the processing chain) can access the fields created from "extraMeta" (ie extraMeta[*].fieldName) from the "_metadata" variable that is automatically passed in if no flags are specified (otherwise make sure the "m" flag is specified - or equivalently "d" to use "_doc,metadata").
The "extraMeta" field can also be used for 2 debugging/error handling cases:
- If a field called "_ONERROR_" is generated then if no links are returned from the first page (ie likely due to a formatting error) then the contents of _ONERROR_ (assumed to be a string) are dumped to the harvest message.
- Only when running from the "Config - Source - Test" API call (including from the Source Editor GUI), then for every page, all of the _ONDEBUG_ field values (can be string or object) are dumped to the harvest message.
Code Block | ||||
---|---|---|---|---|
| ||||
"rss": { "searchConfig": { "extraMeta": [ { "context":"First", "fieldName":"_ONERROR_", "scriptlang":"javascript", "script":"var page = text; page;" }, { "context":"First", "fieldName":"title", // (eg) "scriptlang":"javascript", //... (can return string or object) }, { "context":"First", "fieldName":"_ONDEBUG_", "scriptlang":"javascript", "flags":"m", "script":"var ret = _metadata.title; ret;" }, //... |