Source configuration objects - legacy
Overview
Note that there is a separate overview of how to use these objects to ingest data into Infinit.e. These pages are reference information.
The Config - Source - Get API returns a source document in the following response format (Config - Source - Good is similar but returns an array of source JSON objects instead):
{ response: "action": "Source Info" "success": boolean, "message": "string", // A human readable message (i.e. "Successfully retrieved source info") "time": integer // The number of milliseconds spent performing the query }, data: { ... } // The JSON format below }
JSON format
{ // User-defined top level metadata: "title" : "string", // String, display title for source "description" : "string", // String, display description of documents to be harvested "url" : "string", // String, url/path to documents to harvest "mediaType" : "string", // Type of document being harvested, i.e. Record, Report, etc. Basically a free from string used to populate the corresponding field in the document "tags" : [ "string" ], // Array of tags that are appended to documents harvested for this source // Auto-generated top level metadata: "_id" : "string", // A unique ID for the document "key" : "string", // String, unique identifier for a source based on the url "created": "string", // When the source was originally created (Java date format) "modified" : "string", // When the source was last modified (Java date format) // Social metadata // User-generated: "isPublic" : boolean, // Described below, under source privacy (summary: if "isPublic" is true, only a restricted set of fields are visible) // Admin-generated: "isApproved" : boolean, // When a source is first added to a community, the admin (if different to the owner) must approve it. // Auto-generated: "ownerId": "string", // The "_id" of the creating user (see person object) - only the user and admins have write privileges on the source "communityIds" : [ "string" ], // A list of "_id"s of communities (normally only one) across which the source is shared "appendTagsToDocs": boolean, // Defaults to true, if false then the "tags" array isn't copied to the document // Different extraction types: "extractType" : "string", // Currently supported: "Feed" (for HTTP/RSS), "File" for SMB (shared filesystem) file access, "Database" for SQL access "authentication" : { ... }, // a generic authentication configuration object used (currently) by "Feed" and "Database" harvesters "rss": { ... }, // See RSS object below ("extractType":"Feed" only) "file" : { ... }, // See File object below ("extractType":"File" only) "database" : { ... }, // See Database object below ("extractType":"Database" only) // Enrichment engines (all optional) "useTextExtractor": string, // See "Using enrichment engines" below "useExtractor": string, // See "Using enrichment engines" below "extractorOptions": { ... } // See "Using enrichment engines" below // Custom enrichment: "structuredAnalysis" : { ... }, // See StructuredAnalysis object below "unstructuredAnalysis" : { ... }, // See UnstructuredAnalysis object below // Harvest status: "harvest" : { "harvested" : "string", // The last time the source was checked for new documents (Java date format) "harvest_status" : "string", // The status of the harvest: "success", "in_progress", or "error" "harvest_message" : "string" // A free form message containing the most recent errors encountered while harvesting "synced": "string", // The last time an internal "sychronization" process was performed (not of general interest, Java date format) }, "harvestBadSource" : boolean, // The source is ignored by the harvester if true and reset daily, this is used by the harvester to discard "bad" sources that might recover // (where the harvester deems a source unlikely to recover, it sets its "isApproved" to false.) Note use "searchCycle_secs" to disable sources manually. "searchCycle_secs": integer, // Optional, if set then the source will only be harvested every "searchCycle_secs" seconds (eg set to 86400 to recheck source daily, set to -1, or - the current value, to disable source temporarily) "maxDocs": integer, // Optional, if set then once this threshold is reached then 1 document is deleted for every new document added, in age order "duplicateExistingUrls": boolean, // Optional, if true then this source will never duplicate existing documents within the community, even if the processing performed is different "searchIndexFilter:" { // Optional object that lets the user control which fields are indexed into Lucene, ie are searchable (by default: all of them) - used to improve performance "entityFilter": "string", // (regex applied to entity indexes, plus starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "assocFilter": "string", // (regex applied to new-line separated entity indexes in associations, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "entityGeoFilter": "string", // (regex applied to entity indexes if the entity has geo, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "assocGeoFilter": "string", // (regex applied to new-line separated entity indexes in associations with geo, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "fieldList": "string", // (comma-separated list of doc fields, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) "metadataFieldList": "string" // (comma-separated list of doc fields, starts with "+" or "-" to indicate inclusion/exclusion, defaults to include-only) } }
The sub-objects in the above JSON are described from the following links:
- Harvesting objects:
- Using enrichment engines
- Custom enrichment objects:
Source privacy
Anyone in a community can view all sources within that community. If the "isPublic" field is set to true, then all fields are visible.
Note this includes passwords and javascript code - anything sensitive should be protected with "isPublic": false.
If the "isPublic" field is set to false, then the following fields are removed:
- "authentication"
- "file", "rss", "database"
- "structuredAnalysis"
- "unstructuredAnalysis"
And the following fields are modified:
- "url": everything after the leading "?" is truncated
- "rss.extraUrls.url": as above.