Harvest post processor

Overview

The "harvest post processor" custom plugin enables users with appropriate permissions to modify existing documents and overwrite them seamlessly with the new versions.

It can be used for a number of use cases, eg:

  • To correct source errors (eg wrong title, missing date)
  • To update existing documents based on new information (eg if a lookup table is being used to generate entities)
  • To rebuild indexes if new functionality has resulted in a mapping change in the index (eg as occurred to support the "record view" of documents, Sep 2014 OSS build)

Basic operation

The idea is to start with a template version from the custom GUI (create one if it doesn't already exist - all that is needed is: MapperClass: "com.ikanow.infinit.e.hadoop.processing.InfiniteProcessingEngine$InfiniteMapper", CombinerClass/ReducerClass: "none", Key/Value class Text and BSONWritable respectively), and then:

  • The "query" is either a MongoDB or Infinit.e query that selects the desired documents (full query specification: here) ... often this will involve just "{}" and a list of communities, or just the "$srctags" field
  • The "user arguments" field is a JSON object with the following format:
{
	"rebuildAllCommunities": boolean, // Optional, defaults to false
	"debugMode": boolean, // Optional, defaults to false
	"processingPipeline": [
		{ /* standard source pipeline objects */ }
	]
}

where:

  • If "rebuildAllCommunities" is set to true, then any community containing selected documents is deleted and rebuilt (which fixes any index corruption, old mappings etc)
    • WARNING: USE WITH CAUTION, normally only during designated version upgrades
  • If "debugMode" is set to true, then running the job (whether in full mode - not recommended - or in "save and debug mode" will save the modified docs to the output collection (see below under "Output") instead of making any changes
    • WARNING: If "debugMode" is set to false (ie the default) then "save and debug" will modify the main documents collection (just on a smaller number of records)
  • The "processingPipeline" just takes a standard set of operations documented here, which start with the existing document (entities/associations and all)
    • Note any extractors at the start are just ignored
    • See the "advanced section" for more details

Advanced

This section describes any harvest-post-processor-specific issues with using pipeline elements:

  • (All extractors at the start of the pipeline are ignored - this means you can copy a working source from the source editor or source builder and paste its "processingPipeline" into the custom user arguments)
  • General:
    • If the processingPipeline is blank, then the documents will simply be re-indexed (this is slightly faster because they are not re-saved in the database, only the index)
    • Unless you are a system administrator, you can only change documents where:
      • it belongs to a source which you own
      • It belongs to a community which you own or for which you are a moderator
      • (Other documents are silently discarded)
  • Harvest control
    • "harvest,maxDocs_perCycle" is used to determine how many docs to batch into a single processing job. The more docs the better from a performance point of view, but setting too high can result in the Hadoop job failing either because it runs out of memory or because a single processing cycle takes more than 10 minutes. The default number is 500
      • INFO: If Hadoop errors occur when running it, try reducing this number
    • Other settings are ignored
  • Splitter / Follow Web Links
    • These are currently unsupported and will result in an error
  • Document metadata / manual text / feature engine / text engine
    • Using any of the above elements (document metadata - only if fullText field is set) will reduce the performance because of the possibility that the document fullText field has been modified, hence additional database operations will take place (the decrease in performance does not apply to JSON/XML/CSV files or SQL docs)
  • feature engine / text engine
    • Using any of the above elements (document metadata - only if fullText field is set) will reduce the performance because of the possibility that the document fullText field has been modified, hence additional database operations will take place (the decrease in performance does not apply to JSON/XML/CSV files or SQL docs)
    • If using dynamically uploaded text or feature engines (ie identified by share ids), then currently you have to add a line to the query "$caches": "<id1>,<id2>,etc"
  • Manual entities / Manual associations / Feature Engine
    • If any of these are specified, it is assumed that the entities and associations are changing, which will slow down the overall harvest post processing performance significantly (and proportionally to how many new entities/associatons are created)
    • Entities can be deleted: set the dimension to "delete", or the frequency to <= 0
    • Associations can be deleted, set the "assoc_type" to delete
    • WARNING: currently entity/association frequencies are not re-calculated on entity/association deletion, this will introduce (likely minor) errors in the statistical scoring
  • Search index settings:
    • If a "searchIndex" element is specified with a criteria field, then criteria field is ignored and the element is always applied
    • If multiple searchIndex elements are specified, only the the last one is used
    • If the "searchIndex.indexOnIngest" is set to "false" then the modified document will not be stored to the index, only the data store - note this deletes the document from the index if it is currently indexed, eg it will not return from any queries henceforth
  • Storage settings:
    • By default, any documents that match the "storageSettings.rejectDocCriteria" are simply not modified
    • If, in addition, "storageSettings.deleteExistingDocOnRejection" is set, then the matching documents are actually deleted
      • WARNING: this cannot be undone, except by re-harvesting the specfiied docs, if that's even possible (Eg the doc is still present in the filesystem/RSS feed/etc)
      • Note that this doesn't mean they won't be re-harvested
        • One option if you want to ensure a doc isn't re-harvested is just to set searchIndex.indexOnIngest to false as described above - because it's not indexed, it will never be returned from a query (although it can be returned by a MongoDB query for a custom plugin)
      • (Note that docs that are rejected because they error are never deleted)
      • WARNING: currently entity/association frequencies are not re-calculated on doc deletion, this will introduce (likely minor) errors in the statistical scoring

Output

If "debugMode" is set to true, then all documents are output in their entirety to the standard custom output collection (and nothing is modified). The documents are all output with the same key, "modifiedDocument". (eg this can be used as the "query" term when browsing the output eg using the custom / get API call, or in the Record Analyzer if "$output.indexMode" is set to custom in the query. Deleted documents are output as "deletedDocument".

Regardless of the debugMode setting, the following output record types (set in the "key" field, as above) can be generated:

  • "runProcessingLoop": This occurs every cycle (eg 500) docs, and reports some harvest statistics and error messages against the Hadoop taskID, object with a single string field "message"
  • "completeMapper": the aggregated statistics for a single Hadoop mapper (identified by taskID), object with a single string field "message"
  • "WARNING": warnings based on likely misconfigurations, object with a single string field "error"