Source configuration objects - legacy
Overview
Note that there is a separate overview of how to use these objects to ingest data into Infinit.e. These pages are reference information.
The Config - Source - Get API returns a source document in the following response format (Config - Source - Good is similar but returns an array of source JSON objects instead):
{
response:
"action": "Source Info"
"success": boolean,
"message": "string", // A human readable message (i.e. "Successfully retrieved source info")
"time": integer // The number of milliseconds spent performing the query
},
data: { ... } // The JSON format below
}
JSON format
{
// User-defined top level metadata:
"title" : "string", // String, display title for source
"description" : "string", // String, display description of documents to be harvested
"url" : "string", // String, url/path to documents to harvest
"mediaType" : "string", // Type of document being harvested, i.e. Record, Report, etc. Basically a free from string used to populate the corresponding field in the document
"tags" : [ "string" ], // Array of tags that are appended to documents harvested for this source
// Auto-generated top level metadata:
"_id" : "string", // A unique ID for the document
"key" : "string", // String, unique identifier for a source based on the url
"created": "string", // When the source was originally created (Java date format)
"modified" : "string", // When the source was last modified (Java date format)
// Social metadata
// User-generated:
"isPublic" : boolean, // Described below, under source privacy (summary: if "isPublic" is true, only a restricted set of fields are visible)
// Admin-generated:
"isApproved" : boolean, // When a source is first added to a community, the admin (if different to the owner) must approve it.
// Auto-generated:
"ownerId": "string", // The "_id" of the creating user (see person object) - only the user and admins have write privileges on the source
"communityIds" : [ "string" ], // A list of "_id"s of communities (normally only one) across which the source is shared
"appendTagsToDocs": boolean, // Defaults to true, if false then the "tags" array isn't copied to the document
// Different extraction types:
"extractType" : "string", // Currently supported: "Feed" (for HTTP/RSS), "File" for SMB (shared filesystem) file access, "Database" for SQL access
"authentication" : { ... }, // a generic authentication configuration object used (currently) by "Feed" and "Database" harvesters
"rss": { ... }, // See RSS object below ("extractType":"Feed" only)
"file" : { ... }, // See File object below ("extractType":"File" only)
"database" : { ... }, // See Database object below ("extractType":"Database" only)
// Enrichment engines (all optional)
"useTextExtractor": string, // See "Using enrichment engines" below
"useExtractor": string, // See "Using enrichment engines" below
"extractorOptions": { ... } // See "Using enrichment engines" below
// Custom enrichment:
"structuredAnalysis" : { ... }, // See StructuredAnalysis object below
"unstructuredAnalysis" : { ... }, // See UnstructuredAnalysis object below
// Harvest status:
"harvest" : {
"harvested" : "string", // The last time the source was checked for new documents (Java date format)
"harvest_status" : "string", // The status of the harvest: "success", "in_progress", or "error"
"harvest_message" : "string" // A free form message containing the most recent errors encountered while harvesting
"synced": "string", // The last time an internal "sychronization" process was performed (not of general interest, Java date format)
},
"harvestBadSource" : boolean, // The source is ignored by the harvester if true and reset daily, this is used by the harvester to discard "bad" sources that might recover
// (where the harvester deems a source unlikely to recover, it sets its "isApproved" to false.) Note use "searchCycle_secs" to disable sources manually.
"searchCycle_secs": integer, // Optional, if set then the source will only be harvested every "searchCycle_secs" seconds (eg set to 86400 to recheck source daily, set to -1, or - the current value, to disable source temporarily)
"maxDocs": integer, // Optional, if set then once this threshold is reached then 1 document is deleted for every new document added, in age order
"duplicateExistingUrls": boolean, // Optional, if true then this source will never duplicate existing documents within the community, even if the processing performed is different
"searchIndexFilter:" { // Optional object that lets the user control which fields are indexed into Lucene, ie are searchable (by default: all of them) - used to improve performance
"entityFilter": "string", // (regex applied to entity indexes, plus starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
"assocFilter": "string", // (regex applied to new-line separated entity indexes in associations, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
"entityGeoFilter": "string", // (regex applied to entity indexes if the entity has geo, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
"assocGeoFilter": "string", // (regex applied to new-line separated entity indexes in associations with geo, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
"fieldList": "string", // (comma-separated list of doc fields, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
"metadataFieldList": "string" // (comma-separated list of doc fields, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only)
}
}
The sub-objects in the above JSON are described from the following links:
- Harvesting objects:
- Using enrichment engines
- Custom enrichment objects:
Source privacy
Anyone in a community can view all sources within that community. If the "isPublic" field is set to true, then all fields are visible.
Note this includes passwords and javascript code - anything sensitive should be protected with "isPublic": false.
If the "isPublic" field is set to false, then the following fields are removed:
- "authentication"
- "file", "rss", "database"
- "structuredAnalysis"
- "unstructuredAnalysis"
And the following fields are modified:
- "url": everything after the leading "?" is truncated
- "rss.extraUrls.url": as above.