Overview
The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.
...
Field | Description | Data Type | ||
---|---|---|---|---|
username | Username for file share authentication | string | ||
password | Password for file share authentication | string | ||
domain | Domain location of the file share | string | ||
pathInclude | Optional - regex, only files with complete paths matching the regular expression are processed further | string | ||
pathExclude | Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed) | string | ||
renameAfterParse | Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name") | string | ||
type | One of "json", "xml", "tika", "*sv", or null to auto decide | string | ||
mode | "normal" (defaults if mode not present), "streaming", see below "mode" (from v0.3) is only applied in JSON/XML/*sv modes
| string | ||
XMLRootLevelValues | The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below) | string | ||
XmlIgnoreValues | XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. // (Also reused with completely different meaning for CSV) | string | ||
XmlSourceName | If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV. | string | ||
XmlPrimaryKey | Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV. | string | ||
XmlAttributePrefix |
For XML only, this string is pre-pended to XML attributes before they become JSON fields. |
Connecting to File Locations
...
Code Block |
---|
"description": "A large set of tweets related to Super Storm Sandy", "isApproved": true, "isPublic": false, "mediaType": "Social", "tags": [ "twitter", "gnip" ], "title": "Super Storm Sandy - Twitter: SANDY_SUBSTRING", "processingPipeline": [ { "file": { "XmlPrimaryKey": "link", "XmlSourceName": "", "XmlRootLevelValues": [], "domain": "XXX", "password": "XXX", "username": "XXX", "url": "smb://HOST:139/SHARE/PATH/TO/" } }, |
Info |
---|
Configuring XML
You can use XmlRootLevelValues
to set the root object for xml file parsing.
In the example below, the field "Incident" is set as the root object.
In addition, the parameter XmlIgnoreValues
is used to ignore certain xml nodes in the xml document.
XmlPrimaryKey
identifies the primary key in the data set, and is use to help identify whether a record is new or previously harvested
XmlSourcename is used to build the new document url of the document that will be generated by the file extraction.
...
Configuring XML
You can use XmlRootLevelValues
to set the root object for xml file parsing.
In the example below, the field "Incident" is set as the root object.
In addition, the parameter XmlIgnoreValues
is used to ignore certain xml nodes in the xml document.
XmlPrimaryKey
identifies the primary key in the data set, and is use to help identify whether a record is new or previously harvested
XmlSourcename is used to build the new document url of the document that will be generated by the file extraction.
Code Block |
---|
{
"description": "wits test",
"isPublic": true,
"mediaType": "Report",
"searchCycle_secs": -1,
"tags": [
"incidents",
"nctc",
"terrorism",
"wits",
"events",
"worldwide"
],
"title": "wits test",
"processingPipeline": [
{
"file": {
"XmlIgnoreValues": [
"DefiningCharacteristicList",
"TargetedCharacteristicList",
"WeaponTypeList",
"PerpetratorList",
"VictimList",
"EventTypeList",
"CityStateProvinceList",
"FacilityList"
],
"XmlPrimaryKey": "icn",
"XmlRootLevelValues": [
"Incident"
],
"XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
"domain": "XXX",
"password": "XXX",
"username": "XXX",
"url": "smb://modus:139/wits/allfiles/"
}
}, |
Configuring
...
SV
For SV type files, the root level values and field names can be set manually or automatically.
...
You can use XmlRootLevelValues
to set the root level values/field names.In the source example below, the field names When you do this, CSV parsing occurs automatically and the records are mapped into a metadata object called "csv" with the field names corresponding to the values of this array.
In the source example below, the field names will correspond to the included array: "device","date", "srcIP" etc.
...
:
...
"device","date", "srcIP" etc.
Code Block |
---|
"processingPipeline": [ { "dstIPfile",: { "alertXmlRootLevelValues",: [ "countrydevice" , ], "XmlIgnoreValuesdate": [, "device,date,srcIP", ], "dstIP", "domain": "DOMAINalert", "password": "PASSWORDcountry", "type": "csv"], "usernameXmlIgnoreValues": "USER", [ "url": "smb://FILESHARE:139/cyber_logs/" "device,date,srcIP" } }, |
When you do this, CSV parsing occurs automatically and the records are mapped into a metadata object called "csv" with the field names corresponding to the values of this array.
For example, here is the metadata that is generated using the above source
Code Block |
---|
"fullText": "SCANNER_1 ], 2012-01-01T13:43:00 , 10.0.0.1 , 66.66.66.66 , DUMMY_ALERT_TYPE_1 , United States", "mediaTypedomain": ["LogDOMAIN"],, "metadatapassword": {"info": [{PASSWORD", "alerttype": "DUMMY_ALERT_TYPE_1 csv", "countryusername": "United StatesUSER", "date": "2012-01-01T13:43:00", "deviceurl": "SCANNER_1 ",smb://FILESHARE:139/cyber_logs/" "dstIP": "66.66.66.66", } "srcIP": " 10.0.0.1" }, |
Deriving Field Names Automatically
...
Header Prefixed by String:
If " XmlIgnoreValues
is set to: "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
...
In the example source below, XmlIgnoreValues
automatically identifies the header using "#". This also identifies the field names using the separator ",".
...
In the case where the first header is not prefixed by a string, it is still necessary to identify it as the header row.
For example, consider a header row formated formatted as follows:
"field1,field2,field3"
In this case, XmlIgnorevalues
should be set to the following: [ "\"field1\" ]
...
For SV files, the XmlAttributePrefix
is used to establish the default setting for separator, quote and escape characters: ",\"\\")
You can change the values of XmlAttributePrefix
to change any of these default settings.
For example, to change the separator from "," to ";" the following configuration is required: ";'\\"
Specifying Network Resources
...
For
...
CSV*, JSON and XML files, it is possible to use XmlSourceName
and XmlPrimaryKey
to identify the unique network resources that link to the documents.
The document's unique network resource must be in the following format: "CONSTANT_URL_PATH + VARIABLE_ID"
...
Also, the "VARIABLE_ID"
...
must be one of the fields in the XML/JSON object.
*XmlRootLevelValues must be set
Info |
---|
If it is not possible to specify the URLurl in this manner (but, it is recommended to use "Document metadata" downstream to set the network location using the element displyUrl. This method can be leveraged if there is a single -, not necessarily unique -, URI that is related to the document -eg. either a network resource or a file in a sub-directory of the fileshare ), it is recommended to use the structured analysis handler to set the "displayUrl" parameter. |
Panel |
---|
Footnotes: |
...
Legacy documentation:
File object
Legacy documentation: