File object

JSON format

Note that there is a separate overview of how to use the File Harvester. This page is mostly reference information.

The Source.file object describes how documents can be harvested from a local or network attached file stores.

Source.file object
{
	"username" : "string", // Username for file share authentication,
	"password" : "string", // Password for file share authentication,
	"domain" : "string", // Domain location of the file share, 

    "pathInclude": "string", // Optional - regex, only files with complete paths matching the regular expression are processed further
    "pathExclude": "string', // Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)
	"renameAfterParse" "string", // Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" deletes the file
									// (eg "$path/processed/$name")
 
	"type": "string", // One of "json", "xml", "tika", "*sv", or null to auto decide
 
	"XmlRootLevelValues" : [ "string" ], // The root level value of XML to which parsing should begin 
										// also currently used as an optional field for JSON, if present will create a document each time that field is encountered
										// (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one)
									// (Also reused with completely different meaning for CSV)
	"XmlIgnoreValues" : [ "string" ], // XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. 
										// (Also reused with completely different meaning for CSV)
	"XmlSourceName" : "string", // If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
	"XmlPrimaryKey" : "string", // Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
}

Overview of Harvester

As of beta, the File harvester is set to harvest NetBIOS/Samba file shares. This assumes that the source url is a smb:// url.

The Harvesting Process

The harvester accesses the Samba file share via the Authentication credentials provided in the source's FilePojo. If the directory is not accessable, an error is logged and zero files are returned.  The harvester will traverse a directory up to a depth of 5 directories after a successful connection and return those files for further harvesting and extraction.

As of beta, the harvesting has 2 unique paths based on the file extension.

XML Files

In order to parse xml files, several additional variables need to be defined within the File.

XmlRootLevelValues are XML keys that signify a new object or act as a container to specific data.

For example:

<IncidentList>
   <Incident>
      <ICN>987654321</ICN>
      <Subject>Test Data</Subject>
   </Incident>
   <Incident>
      <ICN>123456789</ICN>
      <Subject>Subject Test</Subject>
   </Incident>
</IncidentList>

In the above example, one wishing to consider per-incident data, would specify "Incident" in the XmlRootLevelValues list.

XmlIgnoreValues are XML keys that contain data that does not need to be harvested.

For Example:

<IncidentList>
   <Incident>
      <ICN>987654321</ICN>
      <Subject>Test Data</Subject>
   </Incident>
   <Unrelated>
      <ICN>123456789</ICN>
      <Subject>Subject Data</Subject>
   </Unreated>
   <Incident>
      <ICN>123456789</ICN>
      <Subject>Subject Test</Subject>
   </Incident>
</IncidentList>

If the Unrelated key contained data that need not be extracted, "Unrelated" would be added to the XmlIgnoreValues list.

XmlSourceName is the url of the FeedPojo's source. Sometimes XML data contains it's own reference to it's data's url which is then set to be the feed's url. This makes it difficult to keep from harvesting duplicated data without parsing out the large XML file again.

XmlPrimaryKey is a XML key that contains a value that is unique to the data.

For Example:

<IncidentList>
   <Incident>
      <ICN>987654321</ICN>
      <Subject>Test Data</Subject>
   </Incident>
   <Incident>
      <ICN>123456789</ICN>
      <Subject>Subject Test</Subject>
   </Incident>
</IncidentList>

For the above example, the ICN is unique for each incident, so the XmlPrimaryKey would be set to "icn". More information about setting the URL using XmlPrimaryKey is described in the File Harvester overview.

Deduplication

The file harvester uses 2 methods of deduplication to ensure both performance and accuracy.  First, the harvester checks to see if the Feed source url has been harvested before. If the feed source url has been harvested, the XML is then completely parsed into feeds as if the file were new. The deduplication then continues on each feed created by the XML against it's extracted url.

Other File Types

The File Harvester uses Apache Tika v1.0 to extract data for other file types. Supported document formats can be found here.

Deduplication

The harvester checks to see if the file's URI has been harvested before. If it has not, it will be harvested for the first time. If the feed has been harvested, the file's modified date is checked against the modified date of the file. If the file is newer, it will be harvested. If the file contains multiple documents (for XML/JSON), then all documents in that file will be deleted and re-harvested.