Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

...

Field Description Data Type

username

Username for file share authentication

string

password

Password for file share authentication

string

domain

Domain location of the file share

string

pathInclude

Optional - regex, only files with complete paths matching the regular expression are processed further

string

pathExclude

Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)

string

renameAfterParse

Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name")

string

type

One of "json", "xml", "tika", "*sv", or null to auto decide

string

mode

"normal" (defaults if mode not present), "streaming", see below

"mode" (from v0.3) is only applied in JSON/XML/*sv modes

In "normal" mode: any time a file containing records is modified then all already-imported records from that file are deleted/updated

In "streaming" mode: the enclosing file of the records is ignored

Warning
One use case that is not well handled by the current file harvester is ingesting log files that are continuously being written to (as opposed to streamed into a succession of smaller files). The script here provides a sample workaround for that sort of issue.

string

XMLRootLevelValues

The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below)

string

XmlIgnoreValues

XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. // (Also reused with completely different meaning for CSV)

string

XmlSourceName

If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.

string

XmlPrimaryKey

Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.

string

XmlAttributePrefix

For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")

For XML only, this string is pre-pended to XML attributes before they become JSON fields.

Connecting to File Locations

...

These are described below

Specifying the

...

Field Names Manually

You can use XmlRootLevelValues to set the field names.

...

Code Block
"processingPipeline": [ { "file": { "XmlIgnoreValues": [ "#" ], "domain": "DOMAIN", "password": "PASSWORD", "type": "csv", "username": "USER", "url": "smb://FILESHARE:139/cyber_logs/" } },

Separator, Quote, Escape

If "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.

By default, the matching portion of the line (eg "#" in the example above) is removed.

...

eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

XmlAttributePrefix:

Info

For "*csv" files where XmlRootLevelValues is set), where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

...

Versions Compared

Old Version 30

New Version 31

Key

Overview

Connecting to File Locations

Specifying the

Field Names Manually

Separator, Quote, Escape

Page Comparison

Versions Compared

Old Version 30

New Version 31

Key

<span class="diff-html-added" data-a11y-before="Start of added content" data-a11y-after="End of added content" id="added-diff-0">[data-colorid=hu9vgsz6ba]{color:#333333} html[data-color-mode=dark] [data-colorid=hu9vgsz6ba]{color:#cccccc}</span>Overview

Connecting to File Locations

Specifying the

Field Names Manually

Separator, Quote, Escape

Overview