Using the File Harvester

Infinit.e supports harvesting files from Windows/Samba shares or a harvester's local filesystem.

Infinit.e supports harvesting data from a variety of file formats including unstructured text files, semi-structured text files, CSV files, and XML files.

There is a separate reference for the File Harvester configuration object.

There are a number of typical strategies for dealing with standard file formats:

  • "Office" documents (PDF, Word, Powerpoint): converted to text by Tika, can then be entity-extracted as normal.
  • Text-based documents (eg emails): can be entity-extracted as normal.
    • The text can also be cleansed (eg of header/footer information) using the Unstructured Analysis Harvester, which can also extract "structured" information such as author, email distribution, subject, send date, etc etc.
  • CSV files: can be turned into metadata using the Unstructured Analysis Harvester or using the automated parser configured as described below. See the source gallery for examples of the different permutations.
  • XML and JSON (see below): is automatically turned into metadata, which can then be converted to entities and associations using the Structured Analysis Harvester.
  • Infinit.e shares: Uploaded "binary files" are treated as office/text documents as above (except ZIP files see later), JSON shares are processed as JSON. Uploaded ZIP files are automatically decompressed on harvest and treated as a directory of other files which are handled as described above.
  • The results of Infinit.e plugins: The results of map/reduce jobs (or less commonly saved queries), eg as could be obtained from Custom - Get Results can be treated as a directory of JSON files.
    • NOTE: this last input type (custom) has one important limitation: if/when the custom job is re-run all the "_ids" change, which means that the old documents are retained rather than being overwritten by new documents. Therefore the documents for the source should be manually deleted from the API or GUI if (as will normally be the case) this is not desired. There is a roadmap item to address this better in the near future.

Whenever the Unstructured Analysis Harvester is used to generate metadata, the Structured Analysis Harvester is then used to turn the metadata into entities and associations.

Harvesting XML Files
Sample XML File Harvester Specification
source : {
   "url": "string", // see below
   ... 
   "file" : {
       "username" : "username", 
       "password" : "password", 
       "domain" : "WORKGROUP", 
	   "type": "xml",
 
       "pathInclude":"^.*[.]xml$",
       "pathExclude":"^.*schema[.]xml$",   
       "XmlRootLevelValues" : ["Incident"],
       "XmlIgnoreValues" : [
           "DefiningCharacteristicList",
           "TargetedCharacteristicList",
           "WeaponTypeList",
           "PerpetratorList",
           "VictimList",
           "EventTypeList",
           "CityStateProvinceList",
           "FacilityList"
       ],
       "XmlSourceName" : "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
       "XmlPrimaryKey" : "icn"
   },
   "useExtractor" : "none",
   ...
}
  • url
    • The URL needs to be in the following format "file://<path including leading '/'>", in which case the filesystem local to the harvester will be used, or "smb://server:port/path", in which case the harvester will attempt to connect to the specified Windows/Samba share, or "s3://" for S3 parsing (see below for S3 install details), or "inf://share/<shareid>/<ignored>" to process Infinit.e shares, or "inf://custom/<customid-or-jobtitle>/<ignored>" to process results of Infinit.e custom jobs.
      • Note the leading "/" is required, eg if the path was "/mnt/test_data", then the URL would be "file:///mnt/test_data", ie 3 slashes.
      • Note also that "file://" sources can only be run by administrators if secure mode is enabled (harvest.secure_mode=true in the configuration).
      • Note also that the local filesystem version is mostly intended for testing, debugging, and "micro installations". The "tomcat" user must have read access to the directories and files on the path.
      • Check that uploaded files are readable by tomcat ("file:") or the username account ("smb://")
      • Note finally that if any of username/password/domain are specified, the URL will be assumed to point to a Windows/Samba share or S3 (see below).
    • S3 is also supported (URL is in the format "s3://bucket_name>/" or "s3://<bucket_name>/path/") .
      • The files in the S3 bucket should be readable by the account specified by the access key.
      • Note for Admins: S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here.
    • Infinit.e share harvesting is described above, and is indicated with the format "inf://share/<shareid>/" (share id can be obtained in the URL field of the file uploader)
      • After the "<shareid>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.
      • The source must share at least one community with the share in order to be processed.
    • Infinit.e custom harvesting is described above, and is indicated with the format "inf://custom/<customid-or-jobtitle>" (custom id and title can be obtained in the URL field of the plugin manager)
      • After the "<customid-or-jobtitle>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.
      • The source must share at least one community with the custom plugin in order to be processed.
    • Regular expressions can be optionally specified to include (pathInclude) only specified files, and/or exclude specified files and directories (pathExclude).

  • extractType
    The extractType field is used to tell the harvester the type of source to extract from, i.e.: File. Other valid values include: Database, Feed, etc.
  • file
    The File object is used to specify the specifics of how to access the data to be extracted and how to extract the individual fields within the source file.
    • username
    • password
      Note: The password field in the Authentication object is currently clear text. If the string value placed in password is clear text it is not encrypted by Infinit.e. Encryption of the password field is planned for a future release. For S3, the Access ID should be entered into the "username", and the Secret Key into the "password" (note - it is recommended for security that you create a separate AWS user with no permissisons other than S3 read/list on the directories)
    • domain
      The port that the database accepts incoming connections on. (Can be left blank for S3).

    • type
      One of "json", "xml", "tika", "*sv", or null to auto decide
    • XmlRootLevelValues
      The root level field of the XML file at which parsing should begin. (Also works for JSON)
      • For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)
        • The fieldnames can also be derived automatically by setting "XmlIgnoreValues", see below. In this case, "XmlRootLevelValues" need not be set.
      • For office files, these strings are used to configure Tika, there are currently 2 types of configuration supported:
        • "output:xml" or "output:html" to change the output of Tika from raw text to XML or HTML.
        • Strings of the format "MEDIATYPE:{ paramName: paramValue, ...}" - <MEDIATYPE> is in standard MIME format and determines which Tika element to configure; the paramNames and paramValues correspond to functions and arguments - see below.
          • Obviously this configuration process requires some familiarity with Tika's internals 
          • Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false) 
    • XmlIgnoreValues
      XML nodes to ignore when parsing the document. (Ignored for JSON)
      • For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.
        • In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
          • eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
          • By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
            • eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"
    • XmlSourceName
      If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey"). Also works for JSON, and "*sv" (when XmlRootLevelValues is set)
    • XmlPrimaryKey
      Primary key field in data set, used to help identify whether a record is new or previously harvested. Also works for JSON (dot notation is supported), and "*sv" (when XmlRootLevelValues is set)
    • XmlAttributePrefix
      For XML only, this string is pre-pended to XML attributes before they become JSON fields.
      • For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")

    • pathInclude, pathExclude: see above under "url"
    • renameAfterParse: If a file is fully harvested then it can be moved or deleted. To delete just set this field to "". To rename, this field should be set to the full path to rename the file, using the substitution variables "$path" (path of directory in which file is currently located) and "$file" (just the filename portion), eg "$path/done/$file", "$path/$file.PROCESSED" etc - it is best used in conjunction with pathExclude, eg pathExclude: ".*DELETED" or ".*/done/.*".
  • useExtractor
    Additional extractor to use (i.e. other than, or in addition to, the Structured Analysis Harvester) to use to extract entity and association data. Note that the "useTextExtractor" field is not used for files.

For XML and JSON file (or "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.