Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

...

Field Description Data Type

username

Username for file share authentication

string

password

Password for file share authentication

string

domain

Domain location of the file share

string

pathInclude

Optional - regex, only files with complete paths matching the regular expression are processed further

string

pathExclude

Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)

string

renameAfterParse

Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name")

string

type

One of "json", "xml", "tika", "*sv", or null to auto decide

string

mode

"normal" (defaults if mode not present), "streaming", see below

"mode" (from v0.3) is only applied in JSON/XML/*sv modes

In "normal" mode: any time a file containing records is modified then all already-imported records from that file are deleted/updated

In "streaming" mode: the enclosing file of the records is ignored

Warning
One use case that is not well handled by the current file harvester is ingesting log files that are continuously being written to (as opposed to streamed into a succession of smaller files). The script here provides a sample workaround for that sort of issue.

string

XMLRootLevelValues

The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below)

Info

If any of the root level (or ignore) values contain a prefix (eg "stix:IpAddress"), then prefixes are used to compare against all fields.

Eg:

"<test:IpAddress></test:IpAddress>" will match "XmlRootLevelValues": [ "IpAddress" ] but not [ "other:IpAddress" ]
"<test:EmailAddress></test:EmailAddress>" will match [ "EmailAddress" ] but not [ "EmailAddress", "other:IpAddress" ]
- (because specifying "other:IpAddress" means that it will be in "prefix mode" - you'd then need to specify "test:EmailAddress" instead of "EmailAddress")

Note also that (currently) the resulting JSON does not include the prefix, even in "prefix mode", eg "<test:Attribute>blah</test:Attribute>" will map to "Attribute": "blah".

string

XmlIgnoreValues

XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. (Also reused with completely different meaning for CSV - see below)

(Note that if the ignore value contains a prefix, then all XML processing is performed in "prefix mode" - see above under "XmlRootLevelValues")

string

XmlSourceName

If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.

string

XmlPrimaryKey

This key is used to build the URL as described above. Also supported for JSON and CSV.

string

XmlPreserveCase

By default, XML field names are lower case, if this is false, then the case of the fieldnames is preserved

XmlAttributePrefix

For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")

For XML only, this string is pre-pended to XML attributes before they become JSON fields. (otherwise attributes are ignored)

Eg <test id="alex"></test> would map to '{ "test": { "PREFIXid": "alex" } }' with XmlAttributePrefix: "", or just an empty object if not specified (the default)

Connecting to File Locations

...

If you take this approach, it is necessary to both identify the header row and to specify how the field names will be identified.

Header Prefixed by String:

If XmlIgnoreValues is set to: "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.

...

Code Block
"processingPipeline": [ { "file": { "XmlIgnoreValues": [ "#" ], "domain": "DOMAIN", "password": "PASSWORD", "type": "csv", "username": "USER", "url": "smb://FILESHARE:139/cyber_logs/" } },

Header Not Prefixed by String:

In the case where the first header is not prefixed by a string, it is still necessary to identify it as the header row.

For example, consider a header row formatted as follows:

"field1","field2","field3"

(no "comment" prefix and wrapped in quotes)

In this case, XmlIgnorevalues should should be set to the following: [ "\"field1\"" ] (ie the first header wrapped in whatever the quote character is, " by default)This identifies the header row and preserves field1 as a column header name . The leading quote tells the platform not to strip away the remainder of the string, so the columns will be named field1, field2, etc

Now example, consider a header row formatted as follows:

field1,field2,field3

(no "comment" prefix and not wrapped in quotes)

In this case, XmlIgnorevalues should be set to the following: [ "\"field1" ] (ie the first header starting with whatever the quote character is, " by default - but not ending with a quote, eg "\"field1,f" would be equivalent ... but "\"field1\"" would not work).

This acts as a special instruction to the platform not to strip the remainder of the string, so the columns will be named field1, field2, etc. By contrast, (eg) "field1,f" would return the columns as ield2,field3, etc

(Worth noting again, all of the above complication is only necessary if using the auto-calculation of header fields. If you are just using XmlRootValues to specify them, just "field1" is fine.)

Separator, Quote, Escape

For SV files, the XmlAttributePrefix is used to establish the default setting for separator, quote and escape characters: ",\"\\")

You can change the values of XmlAttributePrefix to change any of these default settings.

For example, to change the separator from "," to ";" the following configuration is required: ";'\\"

Specifying Network Resources

For CSV*, JSON and XML files, it is possible to use XmlSourceName and XmlPrimaryKey to identify the unique network resources that link to the documents.

...

Versions Compared

Old Version 44

New Version 45

Key

Overview

Connecting to File Locations

Header Prefixed by String:

Header Not Prefixed by String:

Separator, Quote, Escape

Specifying Network Resources

Page Comparison

Versions Compared

Old Version 44

New Version 45

Key

Connecting to File Locations

Header Prefixed by String:

Header Not Prefixed by String:

Separator, Quote, Escape

Specifying Network Resources