Overview
...
Overview
The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.
...
Field | Description | Data Type | ||
---|---|---|---|---|
username | Username for file share authentication | string | ||
password | Password for file share authentication | string | ||
domain | Domain location of the file share | string | ||
pathInclude | Optional - regex, only files with complete paths matching the regular expression are processed further | string | ||
pathExclude | Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed) | string | ||
renameAfterParse | Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name") | string | ||
type | One of "json", "xml", "tika", "*sv", or null to auto decide | string | ||
mode | "normal" (defaults if mode not present), "streaming", see below "mode" (from v0.3) is only applied in JSON/XML/*sv modes
| string | ||
XMLRootLevelValues | The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below) | string | ||
XmlIgnoreValues | XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. // (Also reused with completely different meaning for CSV) | string | ||
XmlSourceName | If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV. | string | ||
XmlPrimaryKey | Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV. | string | ||
XmlAttributePrefix |
For XML only, this string is pre-pended to XML attributes before they become JSON fields. |
Connecting to File Locations
...
Code Block |
---|
{ "description": "wits test", "isPublic": true, "mediaType": "Report", "searchCycle_secs": -1, "tags": [ "incidents", "nctc", "terrorism", "wits", "events", "worldwide" ], "title": "wits test", "processingPipeline": [ { "file": { "XmlIgnoreValues": [ "DefiningCharacteristicList", "TargetedCharacteristicList", "WeaponTypeList", "PerpetratorList", "VictimList", "EventTypeList", "CityStateProvinceList", "FacilityList" ], "XmlPrimaryKey": "icn", "XmlRootLevelValues": [ "Incident" ], "XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=", "domain": "XXX", "password": "XXX", "username": "XXX", "url": "smb://modus:139/wits/allfiles/" } }, |
Configuring CSV/SV
There are two options for configuring CSV:
- Specify the field names manually
- Derive the field names from the header
These are described below
For SV type files, the root level values and field names can be set manually or automatically.
Specifying the Field Names Manually
You can use XmlRootLevelValues
to set the root level values/field names.
In the source example below, the field names will correspond to the included array: "device","date", "srcIP" etc.
...
Code Block |
---|
"fullText": "SCANNER_1 , 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaType": ["Log"], "metadata": {"info": [{ "alert": "DUMMY_ALERT_TYPE_1 ", "country": "United States", "date": "2012-01-01T13:43:00", "device": "SCANNER_1 ", "dstIP": "66.66.66.66", "srcIP": " 10.0.0.1" |
Deriving Field Names Automatically
The field names can also be derived automatically from the headers.
The field "XmlIgnoreValues" is used to identify the headers - the start of each line is compared to each element in "XmlIgnoreValues", if it matches then that line is designated as a header and does not generate a document.
Furthermore, if the header line contains the right number of fields, then it is used to generate the field names used in the "csv" object.
For the purpose of example, consider csv data starting with the # characterthe headers.
If you take this approach, it is necessary to both identify the header row and to specify how the field names will be identified.
Header Prefixed by String:
If "XmlIgnoreValues is set to: "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
By default, the matching portion of the line (eg "#" in the example above) is removed.
To not remove it then simple place the value in quotes (using the specified quote char).
eg. assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"
In the example log file below, the header row is prefixed by '#'.
Code Block |
---|
#Date,Device,SrcIP,dstIP,Alert,Country SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States |
...
In the example source below, XmlIgnoreValues automatically identifies the header using "#, and no document is generated". This also identifies the field names using the separator ",".
Code Block |
---|
"processingPipeline": [ { "file": { "XmlIgnoreValues": [ "#" ], "domain": "DOMAIN", "password": "PASSWORD", "type": "csv", "username": "USER", "url": "smb://FILESHARE:139/cyber_logs/" } }, |
Separator, Quote, Escape
If "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
By default, the matching portion of the line (eg "#" in the example above) is removed.
To not remove it then simple place the value in quotes (using the specified quote char).
eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"
...
}, |
Header Not Prefixed by String:
In the case where the first header is not prefixed by a string, it is still necessary to identify it as the header row.
For example, consider a header row formated as follows:
"field1,field2,field3"
In this case, XmlIgnorevalues
should be set to the following: [ "\"field1\" ]
This identifies the header row and preserves #field1
Separator, Quote, Escape
For SV files, the XmlAttributePrefix
is used to establish the default setting for separator, quote and escape characters: ",\"\\")
You can change the values of XmlAttributePrefix to change any of these default settings.
For example, to change the separator from "," to ";" the following configuration is required: ";'\\"
Info |
---|
...