Document storage settings

Overview

Once data is ingested into Infint.e from the various extractors it is stored in JSON format including its metadata fields and content.  It also contains sub-objects such as entities and associations.

Infinit.e provides a variety of mechanisms by which documents can be updated over time.  For example, the updateCycle_secs field can be set on RSS sources to periodically update documents based on RSS feeds.  You can configure the behavior of how Infinit.e stores documents and updates existing documents by using Document storage settings.  Persistent fields can be called out which will remain intact across document updates.

Format

{
	"display": string,
	"storageSettings": {
		"rejectDocCriteria":string,//OPTIONAL: If populated, runs a user script function and if return value is non-null doesn't create the object and logs the output.  *Not* wrapped in $SCRIPT().
		"onUpdateScript":string,//OPTIONAL: Used to preserve existing metadata when documents are updated, and also to generate new metadata based on the differences between old and new documents. *Not* wrapped in $SCRIPT().
		"metadataFieldStorage"string,//OPTIONAL: A comma-separated list of top-level metadata fields to either exclude (if "metadataFields" starts with '-'), or only include (starts with '+', default) - the fields are deleted at that point in the pipeline.
	}
} 

 

Description

The following table describes the parameters of the document storage settings configuration.

FieldDescription
rejectDocCriteria

OPTIONAL: If populated, runs a user script function and if return value is non-null doesn't create the object and logs the output. *Not* wrapped in $SCRIPT().

onUpdateScript

OPTIONAL: Used to preserve existing metadata when documents are updated, and also to generate new metadata based on the differences between old and new documents. *Not* wrapped in $SCRIPT().

metadataFieldStorage

OPTIONAL: A comma-separated list of metadata fields to either exclude (if "metadataFields" starts with '-'), or only include (starts with '+', default) - the fields are deleted at that point in the pipeline.

If the negative filter (ie starts with '-') is used then metadata fields can be nested, using the dot notation. For the positive filter (default), the fields must be top-level.

Use Cases

The fields of the Document storage settings configuration can be used to support the following use cases

  • Determine which metadata fields will be stored and used for creation of entities/associations

See examples below.

  • Determine how documents will be updated
    • Retain existing metadata/entities/associations
    • Build new metadata/entities/associations

See examples below.

Examples

Document Storage Settings

Metadata Field Storage

Consider this document:

{
	//...
	"metadata": {
		"field1": {
			//...
		},
		"field2": {
			"field2.1": "test",
			"field2.2": "object"
		}
	}
}

Here are some example metadataFieldStorage fields, and the resulting documents after the pipeline element is complete.

"metadataFieldStorage": "+" 
{
	//...
	"metadata": {
	}
}
 
"metadataFieldStorage": "field1" 
{
	//...
	"metadata": {
		"field1": {
			//...
		}
	}
}
 
"metadataFieldStorage": "-field2.2" 
{
	//...
	"metadata": {
		"field1": {
			//...
		}
		"field2": {
			"field2.1": "test"
		}
	}
}
 
"metadataFieldStorage": "field2.2" 
// NOT ALLOWED
 
"metadataFieldStorage": "-field1,field2.2" 
{
	//...
	"metadata": {
		"field2": {
			"field2.1": "test"
		}
	}
}

Filtering Creation of Entities and Associations

rejectDocCriteria provides a way to evaluate some data for a specific set of criteria.  If the return value is non-null (ie. the criteria has matched on some of the data) the document will be discarded.

In the example below, if the JSON field obtained from a twitter aggregation service didn't contain one of the two fields "link" or "object", then it would be discarded.

},        {
            "storageSettings": {
                "rejectDocCriteria": "$SCRIPT( if (null == _doc.metadata.json[0].link || null == _doc.metadata.json[0].object) return 'reject'; )"
            }
        }
    ]
}

 


Updating Documents

It is possible to use onUpdateScript to configure the behavior of how documents will be updated.

Existing documents can be updated in a number of different cases:

  • Files can be updated (changing their "modified time")
  • For RSS feeds/URLs, the source parameter "updateCycle_secs" will periodically update the file.
  • Database sources can be updated as the result of a SQL call.

When a document is updated it is essentially equivalent to deleting and the re-creating it, except that its "_id" field is preserved).

Document storage settings provides a mechanism to do the following useful activities:

  • Preserve metadata from the old document (eg so the entities/associations can be recreated)
  • Generate new metadata (and thence entities/associations) based on the differences between successive documents.

onUpdateScript can be configured with a script, that will either preserve metadata from the old document, or create new metadata.

 The "$SCRIPT" convention used in entity/association scriptlets is not required here.

This script has access to the following Javascript objects:

  • "_old_doc": The document object that is about to be deleted
  • "_doc": The newly created document object after all metadata/entity/association creation.

The last evaluated expression in the script (eg you don't "return val;" you just end the script "val;"), which can be a string, an object, or an array of objects is placed in a metadata field called "_PERSISTENT_".

Preserving Metadata From Old Versions Of Documents

The following code saves the entirety of the old document's metadata:

 

 "onUpdateScript": "var retVal = _old_doc.metadata; retVal;"}
// RESULT (IN THE CASE OF A DOCUMENT THAT DOESN'T CHANGE):
{
    // Usual document fields
    "metadata": {
        "test1": "test",
        "test2": { "field": "value" },
        "_PERSISTENT_": [{
            "test1": "test",
            "test2": { "field": "value" },
        }]
    }
}

Generating New Metadata Based On Both New And Old Versions Of Documents

In this example, the return value will represent the delta of the two documents under comparison.

"onUpdateScript": "var delta = _old_doc.metadata.length - _doc.metadata.length; var retVal = { 'delta': delta }; retVal;"}

Footnotes:

Legacy documentation:

Legacy documentation: