Source configuration objects - legacy

Overview

Note that there is a separate overview of how to use these objects to ingest data into Infinit.e. These pages are reference information.

The Config - Source - Get API returns a source document in the following response format (Config - Source - Good is similar but returns an array of source JSON objects instead):

Query Response Format
{
      response: 
          "action": "Source Info"
          "success": boolean,
          "message": "string", // A human readable message (i.e. "Successfully retrieved source info")
          "time": integer // The number of milliseconds spent performing the query
      },
      data: { ... } // The JSON format below
}

JSON format

Source Document
{
	// User-defined top level metadata:
	"title" : "string", // String, display title for source
	"description" : "string", // String, display description of documents to be harvested
	"url" : "string", // String, url/path to documents to harvest
	"mediaType" : "string", // Type of document being harvested, i.e. Record, Report, etc. Basically a free from string used to populate the corresponding field in the document

	"tags" : [ "string" ], // Array of tags that are appended to documents harvested for this source
 
	// Auto-generated top level metadata:
	"_id" : "string", // A unique ID for the document
	"key" : "string", // String, unique identifier for a source based on the url
	"created": "string", // When the source was originally created (Java date format)
	"modified" : "string", // When the source was last modified (Java date format)
 
	// Social metadata
	// User-generated:
	"isPublic" : boolean, // Described below, under source privacy (summary: if "isPublic" is true, only a restricted set of fields are visible)
	// Admin-generated:
	"isApproved" : boolean, // When a source is first added to a community, the admin (if different to the owner) must approve it. 
	// Auto-generated:
	"ownerId": "string", // The "_id" of the creating user (see person object) - only the user and admins have write privileges on the source
	"communityIds" : [ "string" ], // A list of "_id"s of communities (normally only one) across which the source is shared
	"appendTagsToDocs": boolean, // Defaults to true, if false then the "tags" array isn't copied to the document
 
	// Different extraction types:
	"extractType" : "string", // Currently supported: "Feed" (for HTTP/RSS), "File" for SMB (shared filesystem) file access, "Database" for SQL access
	"authentication" : { ... }, // a generic authentication configuration object used (currently) by "Feed" and "Database" harvesters
	"rss": { ... }, // See RSS object below ("extractType":"Feed" only)
	"file" : { ... }, // See File object below ("extractType":"File" only)
	"database" : { ... }, // See Database object below ("extractType":"Database" only)
 
	// Enrichment engines (all optional)
	"useTextExtractor": string, // See "Using enrichment engines" below
	"useExtractor": string, // See "Using enrichment engines" below
	"extractorOptions": { ... } // See "Using enrichment engines" below
	// Custom enrichment:
	"structuredAnalysis" : { ... }, // See StructuredAnalysis object below
	"unstructuredAnalysis" : { ... }, // See UnstructuredAnalysis object below
 
	// Harvest status:
	"harvest" : {
		"harvested" : "string", // The last time the source was checked for new documents (Java date format)
		"harvest_status" : "string", // The status of the harvest: "success", "in_progress", or "error"
		"harvest_message" : "string" // A free form message containing the most recent errors encountered while harvesting
		"synced": "string", // The last time an internal "sychronization" process was performed (not of general interest, Java date format)
	},
	"harvestBadSource" : boolean, // The source is ignored by the harvester if true and reset daily, this is used by the harvester to discard "bad" sources that might recover
		// (where the harvester deems a source unlikely to recover, it sets its "isApproved" to false.) Note use "searchCycle_secs" to disable sources manually.
 
	"searchCycle_secs": integer, // Optional, if set then the source will only be harvested every "searchCycle_secs" seconds (eg set to 86400 to recheck source daily, set to -1, or - the current value, to disable source temporarily)
	"maxDocs": integer, // Optional, if set then once this threshold is reached then 1 document is deleted for every new document added, in age order
	"duplicateExistingUrls": boolean,  // Optional, if true then this source will never duplicate existing documents within the community, even if the processing performed is different
 
	"searchIndexFilter:" { // Optional object that lets the user control which fields are indexed into Lucene, ie are searchable (by default: all of them) - used to improve performance
		"entityFilter": "string", // (regex applied to entity indexes, plus starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocFilter": "string", // (regex applied to new-line separated entity indexes in associations, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"entityGeoFilter": "string", // (regex applied to entity indexes if the entity has geo, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"assocGeoFilter": "string", // (regex applied to new-line separated entity indexes in associations with geo, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"fieldList": "string", // (comma-separated list of doc fields, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
		"metadataFieldList": "string"  // (comma-separated list of doc fields, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
	}
}

The sub-objects in the above JSON are described from the following links:

Source privacy

Anyone in a community can view all sources within that community. If the "isPublic" field is set to true, then all fields are visible.

Note this includes passwords and javascript code - anything sensitive should be protected with "isPublic": false.

If the "isPublic" field is set to false, then the following fields are removed:

  • "authentication"
  • "file", "rss", "database"
  • "structuredAnalysis"
  • "unstructuredAnalysis"

And the following fields are modified:

  • "url": everything after the leading "?" is truncated
  • "rss.extraUrls.url": as above.