Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • extractType
    The extractType field is used to tell the harvester the type of source to extract from, i.e.: File. Other valid values include: Database, Feed, etc.
  • file
    The File object is used to specify the specifics of how to access the data to be extracted and how to extract the individual fields within the source file.
    • username
    • password
      Note: The password field in the Authentication object is currently clear text. If the string value placed in password is clear text it is not encrypted by Infinit.e. Encryption of the password field is planned for a future release. For S3, the Access ID should be entered into the "username", and the Secret Key into the "password" (note - it is recommended for security that you create a separate AWS user with no permissisons other than S3 read/list on the directories)
    • domain
      The port that the database accepts incoming connections on. (Can be left blank for S3).

    • type
      One of "json", "xml", "tika", "*sv", or null to auto decide
    • XmlRootLevelValues
      The root level field of the XML file at which parsing should begin. (Also works for JSON)
      • For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)
      • For office files, these strings are used to configure Tika, there are currently 2 types of configuration supported:
        • "output:xml" or "output:html" to change the output of Tika from raw text to XML or HTML.
        • Strings of the format "MEDIATYPE:{ paramName: paramValue, ...}" - <MEDIATYPE> is in standard MIME format and determines which Tika element to configure; the paramNames and paramValues correspond to functions and arguments - see below.
          • Obviously this configuration process requires some familiarity with Tika's internals 
          • Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false) 
    • XmlIgnoreValues
      XML nodes to ignore when parsing the document. (Ignored for JSON)
      • For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.
    • XmlSourceName
      If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey"). Also works for JSON, and "*sv" (when XmlRootLevelValues is set)
    • XmlPrimaryKey
      Primary key field in data set, used to help identify whether a record is new or previously harvested. Also works for JSON (dot notation is supported), and "*sv" (when XmlRootLevelValues is set)
    • XmlAttributePrefix
      For XML only, this string is pre-pended to XML attributes before they become JSON fields.
      • For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")

    • pathInclude, pathExclude: see above under "url"
    • renameAfterParse: If a file is fully harvested then it can be moved or deleted. To delete just set this field to "". To rename, this field should be set to the full path to rename the file, using the substitution variables "$path" (path of directory in which file is currently located) and "$file" (just the filename portion), eg "$path/done/$file", "$path/$file.PROCESSED" etc - it is best used in conjunction with pathExclude, eg pathExclude: ".*DELETED" or ".*/done/.*".

...