Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

This page has been broken down into the following sections for ease of localization.

Table of Contents

TODO

Format

...

language	js

...

Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

Panel

In this section:

Table of Contents

maxLevel	2
indent	16px

Format

Code Block

language	js

{
	"display": string,
	"file":
	{
 		"username" : "string", //

...

Username for file share authentication,
 		"password" : "string", // Password for file share authentication,
 		"domain" : "string", //

...

Domain

...

location

...

of

...

the

...

file share,

...

"pathInclude": "string", //

...

Optional

...

Legacy documentation:

File object

TODO

Description

The file extractor ingests various file types from their locations and performs processing based on the configuration.

Locations

The File Extractor is capable of ingesting files from the following locations:

Windows/Samba shares
harvester's local filesystem
Amazon S3

File Types

The File Extractor supports the following file types

Office documents (Word, Powerpoint etc.)
text-based documents (emails)
CSV
XML and JSON
Infinit.e shares
The results of Infinit.e plugins

Connecting to File Locations

The configuration will depend on the locations of the files you are trying to extract.

Local Filesystem

To connect to the text extractor's local filesystem the following url format must be used:

"file://<path including leading '/'>"

Note

"file://" sources can only be run by administrators if secure mode is enabled (harvest.secure_mode=true in the configuration).

Local filesystem usage is mostly intended for testing, debugging, and "micro installations". The "tomcat" user must have read access to the directories and files on the path.

Infinit.e

You can connect the File extractor to Infinit.e shares and the results of custom jobs.

Infinit.e Shares

To connect to an Infinit.e share, the following url format must be used:

"inf://share/<shareid>/<ignored>"

The share id can be obtained in the url of the file uploader.

After the "<shareid>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.

The source must share at least one community with the share in order to be processed.

Infinit.e Jobs

To connect to an Infinit.e custom job, the following url format must be used:

"inf://custom/<customid-or-jobtitle>"

custom id and title can be obtained in the URL field of the p lugin manager.

After the "<customid-or-jobtitle>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.

The source must share at least one community with the custom plugin in order to be processed.

Windows/Samba

To connect to a Windows/Samba share, the following url format must be used:

"smb://server:port/path"

Amazon S3

To connect to an Amazon S3 location, the following url format must be used:

"s3://bucket_name>/" or "s3://<bucket_name>/path/"

The files in the S3 bucket should be readable by the account specified by the access key.

Note
S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here.

Username/Password

A username/password is required to connect to your Amazon S3 environment.

For S3, the Access ID should be entered into the "username", and the Secret Key into the "password"

Note
It is recommended for security that you create a separate AWS user with no permissions other than S3 read/list on the directories.

Domain

This field can be left blank for Amazon S3 environments.

File Types

This section describes the configurations for the various supported file types.

.SV Files

You can use XmlRootLevelValues to determine the root level field of the XML file at which parsing should begin.

For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

XmlIgnoreValues

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

...

- regex, only files with complete paths matching the regular expression are processed further
 		"pathExclude": "string', // Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)
 		"renameAfterParse" "string", // Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file
		 	// (eg "$path/processed/$name")
 
 		"type": "string", // One of "json", "xml", "tika", "*sv", or null to auto decide
		"mode": "string", // "normal" (defaults if mode not present), "streaming", see below 
 		"XmlRootLevelValues" : [ "string" ], // The root level value of XML to which parsing should begin 
 			// also currently used as an optional field for JSON, if present will create a document each time that field is encountered
 			// (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one)
 			// (Also reused with completely different meaning for CSV - see below)
			// (In office mode, can be used to configure Tika - see below)
 		"XmlIgnoreValues" : [ "string" ], // XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. 
 		// (Also reused with completely different meaning for CSV)
 		"XmlSourceName" : "string", // If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
 		"XmlPrimaryKey" : "string", // Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
		"XmlPreserveCase": boolean, // default false, converts everything to lower case
		"XmlAttributePrefix": "string", // default: null - if enabled, attributes are converted into tags with this prefix
	}
}

Description

The file extractor ingests various file types from their locations and performs processing based on the configuration.

The File Extractor is capable of ingesting files from the following locations:

Windows/Samba shares
harvester's local filesystem
Amazon S3

The File Extractor supports the following file types

Office documents (Word, Powerpoint etc.)
text-based documents (emails)
CSV
XML and JSON
Infinit.e shares
The results of Infinit.e plugins

The following table describes the parameters of the file extractor configuration.

Field Description Data Type

username

Username for file share authentication

string

password

Password for file share authentication

string

domain

Domain location of the file share

string

pathInclude

Optional - regex, only files with complete paths matching the regular expression are processed further

string

pathExclude

Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)

string

renameAfterParse

Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name")

string

type

One of "json", "xml", "tika", "*sv", or null to auto decide

string

mode

"normal" (defaults if mode not present), "streaming", see below

"mode" (from v0.3) is only applied in JSON/XML/*sv modes

In "normal" mode: any time a file containing records is modified then all already-imported records from that file are deleted/updated

In "streaming" mode: the enclosing file of the records is ignored

Warning
One use case that is not well handled by the current file harvester is ingesting log files that are continuously being written to (as opposed to streamed into a succession of smaller files). The script here provides a sample workaround for that sort of issue.

string

XMLRootLevelValues

The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below)

Info

If any of the root level (or ignore) values contain a prefix (eg "stix:IpAddress"), then prefixes are used to compare against all fields.

Eg:

"<test:IpAddress></test:IpAddress>" will match "XmlRootLevelValues": [ "IpAddress" ] but not [ "other:IpAddress" ]
"<test:EmailAddress></test:EmailAddress>" will match [ "EmailAddress" ] but not [ "EmailAddress", "other:IpAddress" ]
- (because specifying "other:IpAddress" means that it will be in "prefix mode" - you'd then need to specify "test:EmailAddress" instead of "EmailAddress")

Note also that (currently) the resulting JSON does not include the prefix, even in "prefix mode", eg "<test:Attribute>blah</test:Attribute>" will map to "Attribute": "blah".

string

XmlIgnoreValues

XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. (Also reused with completely different meaning for CSV - see below)

(Note that if the ignore value contains a prefix, then all XML processing is performed in "prefix mode" - see above under "XmlRootLevelValues")

string

XmlSourceName

If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.

string

XmlPrimaryKey

This key is used to build the URL as described above. Also supported for JSON and CSV.

string

XmlPreserveCase

By default, XML field names are lower case, if this is false, then the case of the fieldnames is preserved

XmlAttributePrefix

For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")

For XML only, this string is pre-pended to XML attributes before they become JSON fields. (otherwise attributes are ignored)

Eg <test id="alex"></test> would map to '{ "test": { "PREFIXid": "alex" } }' with XmlAttributePrefix: "", or just an empty object if not specified (the default)

Connecting to File Locations

The configuration will depend on the locations of the files you are trying to extract.

Local Filesystem

To connect to the text extractor's local filesystem the following url format must be used:

"file://<path including leading '/'>"

Info

"file://" sources can only be run by administrators if secure mode is enabled (harvest.secure_mode=true in the configuration).

Local filesystem usage is mostly intended for testing, debugging, and "micro installations". The "tomcat" user must have read access to the directories and files on the path.

Infinit.e

You can connect the File extractor to Infinit.e shares and the results of custom jobs.

Infinit.e Shares

To connect to an Infinit.e share, the following url format must be used:

"inf://share/<shareid>/<ignored>"

The share id can be obtained in the url of the file uploader.

After the "<shareid>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.

The source must share at least one community with the share in order to be processed.

Warning
The uploaded share must be a zip file containing the JSON/XML/CSV files to be imported.

Infinit.e Jobs

To connect to an Infinit.e custom job, the following url format must be used:

"inf://custom/<customid-or-jobtitle>"

custom id and title can be obtained in the URL field of the p lugin manager.

After the "<customid-or-jobtitle>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.

The source must share at least one community with the custom plugin in order to be processed.

Windows/Samba

To connect to a Windows/Samba share, the following url format must be used:

"smb://server:port/path"

Amazon S3

To connect to an Amazon S3 location, the following url format must be used:

"s3://bucket_name>/" or "s3://<bucket_name>/path/"

The files in the S3 bucket should be readable by the account specified by the access key.

Info

S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here.

The S3 harvester will only read 1000 docs per directory. To handle larger numbers of documents, they should either be split into smaller subdirectories, or "renameAfterParse" should be used to delete/move them after processing.

Username/Password

A username/password is required to connect to your Amazon S3 environment.

For S3, the Access ID should be entered into the "username", and the Secret Key into the "password"

Info
It is recommended for security that you create a separate AWS user with no permissions other than S3 read/list on the directories.

Domain

This field can be left blank for Amazon S3 environments.

File Types

This section describes the configurations for the various supported file types.

Office Files

To connect to Office files, the following url format must be used:

the path of the file (ie file.url + path-relative-to-url)

Example:

Connects to an "office" document on a samba drive.

"smb://modus:139/enron/enron_mail_20110402/maildir/"

Configuring Apache Tika

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

XmlRootlevelValues will accept a sting value which can be used to pass configuration values to Apache's Tika module. Currently there are the following supported configurations:

Configuring Tika Output Format

You can include the string "output:xml" or "output:html" to change the output of Tika from raw text to XML or HTML.

Configuring Tika Elements

You can configure Tika Elements by using paramName and paramValue to send functions and arguments to Tika.

The string must be in the following format

bypass:<MEDIATYPE>
- (will just return the raw text - binary not current supported - for the corresponding mediatype, eg "bypass:message/rfc822" will just return the raw text from an email)
Strings of the format "<MEDIATYPE>:{ paramName: paramValue, ...}"

<MEDIATYPE> is in standard MIME format and determines which Tika element to configure. The paramNames and paramValues correspond to functions and arguments.

Example:

"application/pdf:{'setEnableAutoSpace':false}"

where application/pdf will calll PDFParser. and setEnableAutoSpace(false)corresponds to a paramName and paramValue.

JSON/XML/CSV

To connect to these file types, the following url format must be used:

path-of-file (as above) + <hash of object> + ".csv"/.json/.xml

Example:

To connect to a samba fileshare

"url": "smb://FILESHARE:139/cyber_logs/"

If xmlsourcename and xmlprimarykey are specified, the following url format must be used

xmlsourcename + object,get(xmlprimarykey)

Example:

"url": "smb://HOST:139/SHARE/PATH/TO/"

Configuring JSON

You can use the file extractor to configure the root JSON object for parsing.

In the example below, the parameter XMLRootlevelValues is used to set the root object.

In addition, you can use XmlSourceName to build the document URL. If specified, the document URL is built as "XmlSourceName" + xml("XmlPrimaryKey").

You can use the parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

Info
For JSON files, the parameter `XmlIgnoreValues` is not applicable.

Code Block

"description": "A large set of tweets related to Super Storm Sandy",
    "isApproved": true,
    "isPublic": false,
    "mediaType": "Social",
    "tags": [
        "twitter",
        "gnip"
    ],
    "title": "Super Storm Sandy - Twitter: SANDY_SUBSTRING",
    "processingPipeline": [
        {
            "file": {
                "XmlPrimaryKey": "link",
                "XmlSourceName": "",
                "XmlRootLevelValues": [],
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://HOST:139/SHARE/PATH/TO/"
            }
        },

Configuring XML

You can use XmlRootLevelValues to set the root object for xml file parsing.

In the example below, the field "Incident" is set as the root object.

In addition, the parameter XmlIgnoreValues is used to ignore certain xml nodes in the xml document.

XmlPrimaryKey identifies the primary key in the data set, and is use to help identify whether a record is new or previously harvested

XmlSourcename is used to build the new document url of the document that will be generated by the file extraction.

Code Block

{
    "description": "wits test",
    "isPublic": true,
    "mediaType": "Report",
    "searchCycle_secs": -1,
    "tags": [
        "incidents",
        "nctc",
        "terrorism",
        "wits",
        "events",
        "worldwide"
    ],
    "title": "wits test",
    "processingPipeline": [
        {
            "file": {
                "XmlIgnoreValues": [
                    "DefiningCharacteristicList",
                    "TargetedCharacteristicList",
                    "WeaponTypeList",
                    "PerpetratorList",
                    "VictimList",
                    "EventTypeList",
                    "CityStateProvinceList",
                    "FacilityList"
                ],
                "XmlPrimaryKey": "icn",
                "XmlRootLevelValues": [
                    "Incident"
                ],
                "XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://modus:139/wits/allfiles/"
            }
        },

Configuring SV

Files of separated values (SV) are parsed using the configuration object, and the header row and fields can be either manually or automatically specified.

Specifying Header Fields Manually

You can use XmlRootLevelValues to manually specify the header fields that will be mapped onto the metadata object "csv". Field names will correspond to the values of this array.

In the source example below, the field names will correspond to the included array: "device","date", "srcIP" etc.

Code Block

 "processingPipeline": [        {
            "file": {
                "XmlRootLevelValues": [
                    "device",
                    "date",
                    "srcIP",
                    "dstIP",
                    "alert",
                    "country"
                ],
                "XmlIgnoreValues": [
                    "device,date,srcIP"
                ],
                "domain": "DOMAIN",
                "password": "PASSWORD",
                "type": "csv",
                "username": "USER",
                "url": "smb://FILESHARE:139/cyber_logs/"
            }
        },

Specifying Header Fields Automatically

You can also use the configuration object to instruct the system to parse header fields automatically.

If you take this approach, it is necessary to both identify the header row and to specify how the field names will be identified.

Header Prefixed by String:

Consider the following device log file, where the header row is prefixed with "#".

Code Block
#DeviceName,Date,Time,SrcIP,dstIP,Alert,Country SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States

In this case you can configure XmlIgnoreValues as follows: [ "#" ], and this will instruct the system to identify the header row. By default, the field names are identified using the separator ",".

Code Block
"processingPipeline": [ { "file": { "XmlIgnoreValues": [ "#" ], "domain": "DOMAIN", "password": "PASSWORD", "type": "csv", "username": "USER", "url": "smb://FILESHARE:139/cyber_logs/" } },

If XmlIgnoreValues is set to: "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.

By default, the matching portion of the line (eg "#" in the example above) is removed.

To not remove it then simple place the value in quotes (using the specified quote char).

eg. assuming the quote

...

XmlSourceName

For .sv files you can use the XmlSourcename parameter to build the document url.

Note
XmlRootLevelValues must be set.

...

You can use XmlPrimaryKey to help identify whether a record is new or previously harvested. This requires tat that the parameter XmlRootLevelValues has been set.

Office Files

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

There are currently 2 types of configuration supported:

...

Examples:

Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false)

JSON/XML

For JSON files the parameter XmlIgnoreValues is not applicable.

You can use XmlSourceName to build the document url. If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey").

You can usethe parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

Note

For XML and JSON file where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. For JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

CSV Files

...

char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3".

Header Not Prefixed by String:

Consider the log file formatted as follows:

Code Block
"DeviceName","Date","Time""SrcIP","dstIP","Alert","Country" SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States

In this case, XmlIgnorevalues should be set to the following: [ "\"Device\"" ]

ie. the first header wrapped in whatever the quote character is, " by default.

The leading quote tells the platform not to strip away the remainder of the string, so the columns will be named DeviceName, Date, etc

Now consider the log file formatted as follows:

Code Block
DeviceName,Date,Time,SrcIP,dstIP,Alert,Country SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States

In this case, XmlIgnorevalues should be set to the following: [ "\"Device" ]

ie. the first header starting with whatever the quote character is, " by default - but not ending with a quote,

eg. "\"field1,f" would be equivalent ... but "\"field1\"" would not work.

This acts as a special instruction to the platform not to strip the remainder of the string, so the columns will be named DeviceName, Date, etc.

By contrast, "field1,f" would return the columns as ield2,field3, etc.

Separator, Quote, Escape

For SV files, the XmlAttributePrefix is used to establish the default setting for separator, quote and escape characters: ",\"\\")

You can change the values of XmlAttributePrefix to change any of these default settings.

For example, to change the separator from "," to ";" the following configuration is required: ";'\\"

Specifying Network Resources

For CSV*, JSON and XML files, it is possible to use XmlSourceName and XmlPrimaryKey to identify the unique network resources that link to the documents.

The document's unique network resource must be in the following format: "CONSTANT_URL_PATH + VARIABLE_ID" Also, the "VARIABLE_ID" must be one of the fields in the XML/JSON object

...

.

*XmlRootLevelValues must be set

Info

If it is not possible to specify the url in this manner, it is recommended to use "Document metadata" downstream to set the network location using the element displyUrl.

This method can be leveraged if there is a single, not necessarily unique, URI that is related to the document

...

Panel

Legacy documentation:

File object

Using the File Harvester

TODO

Page Comparison

Versions Compared

Old Version 12

New Version Current

Key

Overview

Format

Overview

Format

Description

Locations

File Types

Connecting to File Locations

Local Filesystem

Infinit.e

Infinit.e Shares

Infinit.e Jobs

Windows/Samba

Amazon S3

Username/Password

Domain

File Types

.SV Files

XmlIgnoreValues

Description

Connecting to File Locations

Local Filesystem

Infinit.e

Infinit.e Shares

Infinit.e Jobs

Windows/Samba

Amazon S3

Username/Password

Domain

File Types

Office Files

Configuring Apache Tika

Configuring Tika Output Format

Configuring Tika Elements

JSON/XML/CSV

Configuring JSON

Configuring XML

Configuring SV

Specifying Header Fields Manually

Specifying Header Fields Automatically

Header Prefixed by String:

XmlSourceName

Office Files

JSON/XML

CSV Files

Header Not Prefixed by String:

Separator, Quote, Escape

Specifying Network Resources

Examples