Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

...

Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

...

Code Block
languagejs
{
	"display": string,
	"file":
	{
 		"username" : "string", // Username for file share authentication,
 		"password" : "string", // Password for file share authentication,
 		"domain" : "string", // Domain location of the file share, 
 
 		"pathInclude": "string", // Optional - regex, only files with complete paths matching the regular expression are processed further
 		"pathExclude": "string', // Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)
 		"renameAfterParse" "string", // Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file
		 	// (eg "$path/processed/$name")
 
 		"type": "string", // One of "json", "xml", "tika", "*sv", or null to auto decide
		"mode": "string", // "normal" (defaults if mode not present), "streaming", see below 
 		"XmlRootLevelValues" : [ "string" ], // The root level value of XML to which parsing should begin 
 			// also currently used as an optional field for JSON, if present will create a document each time that field is encountered
 			// (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one)
 			// (Also reused with completely different meaning for CSV - see below)
			// (In office mode, can be used to configure Tika - see below)
 		"XmlIgnoreValues" : [ "string" ], // XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. 
 		// (Also reused with completely different meaning for CSV)
 		"XmlSourceName" : "string", // If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
 		"XmlPrimaryKey" : "string", // Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
	}
}	"XmlPreserveCase": boolean, // default false, converts everything to lower case
		"XmlAttributePrefix": "string", // default: null - if enabled, attributes are converted into tags with this prefix
	}
}

 

Description

The file extractor ingests various file types from their locations and performs processing based on the configuration.

...

FieldDescription Data Type
username

Username for file share authentication

 string
password

Password for file share authentication

 string
domain

Domain location of the file share

 string
pathInclude

Optional - regex, only files with complete paths matching the regular expression are processed further

 string
pathExclude

Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)

 string
renameAfterParse

Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name")

 string
type

One of "json", "xml", "tika", "*sv", or null to auto decide

 string
mode

"normal" (defaults if mode not present), "streaming", see below

"mode" (from v0.3) is only applied in JSON/XML/*sv modes

  • In "normal" mode: any time a file containing records is modified then all already-imported records from that file are deleted/updated
  • In "streaming" mode: the enclosing file of the records is ignored

    Warning

    One use case that is not well handled by the current file harvester is ingesting log files that are continuously being written to (as opposed to streamed into a succession of smaller files). The script here provides a sample workaround for that sort of issue.

 string
XMLRootLevelValues

The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below)

 string
XmlIgnoreValues

XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. // (Also reused with completely different meaning for CSV)

 string
XmlSourceName

If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.

 string
XmlPrimaryKey

Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.

 string
XmlAttributePrefixFor
XmlPreserveCase
By default, XML field names are lower case, if this is false, then the case of the fieldnames is preserved 
XmlAttributePrefix
  • For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")

For XML only, this string is pre-pended to XML attributes before they become JSON fields. (otherwise attributes are ignored)

  • Eg <test id="alex"></test> would map to '{ "test": { "PREFIXid": "alex" } }' with XmlAttributePrefix: "", or just an empty object if not specified (the default)
 

Connecting to File Locations

...

For example, consider a header row formatted as follows:

"field1,field2,field3"

In this case, XmlIgnorevalues should be set to the following: [ "\"field1\" ]

...

For SV files, the XmlAttributePrefix is used to establish the default setting for separator, quote and escape characters: ",\"\\")

You can change the values of XmlAttributePrefix to change any of these default settings.

For example, to change the separator from "," to ";" the following configuration is required: ";'\\"

Specifying Network Resources

 For CSV*, JSON and XML files,  it is possible to use XmlSourceName and XmlPrimaryKey to identify the unique network resources that link to the documents. 

...