Infinit.e supports harvesting files from Windows/Samba shares or a harvester's local filesystem.

Infinit.e supports harvesting data from a variety of file formats including unstructured text files, semi-structured text files, CSV files, and XML files.

There is a separate reference for the File Harvester configuration object.

There are a number of typical strategies for dealing with standard file formats:

Whenever the Unstructured Analysis Harvester is used to generate metadata, the Structured Analysis Harvester is then used to turn the metadata into entities and associations.

Harvesting XML Files
source : {
   "url": "string", // see below
   ... 
   "file" : {
       "username" : "username", 
       "password" : "password", 
       "domain" : "WORKGROUP", 
	   "type": "xml",
 
       "pathInclude":"^.*[.]xml$",
       "pathExclude":"^.*schema[.]xml$",   
       "XmlRootLevelValues" : ["Incident"],
       "XmlIgnoreValues" : [
           "DefiningCharacteristicList",
           "TargetedCharacteristicList",
           "WeaponTypeList",
           "PerpetratorList",
           "VictimList",
           "EventTypeList",
           "CityStateProvinceList",
           "FacilityList"
       ],
       "XmlSourceName" : "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
       "XmlPrimaryKey" : "icn"
   },
   "useExtractor" : "none",
   ...
}

For XML and JSON file (or "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.