Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

This page has been broken down into the following sections for ease of localization.

Format

{
	"display": string,
	"file":
	{
 		"username" : "string", // Username for file share authentication,
 		"password" : "string", // Password for file share authentication,
 		"domain" : "string", // Domain location of the file share, 
 
 		"pathInclude": "string", // Optional - regex, only files with complete paths matching the regular expression are processed further
 		"pathExclude": "string', // Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)
 		"renameAfterParse" "string", // Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" deletes the file
		 	// (eg "$path/processed/$name")
 
 		"type": "string", // One of "json", "xml", "tika", "*sv", or null to auto decide
		"mode": "string", // "normal" (defaults if mode not present), "streaming", see below 
 		"XmlRootLevelValues" : [ "string" ], // The root level value of XML to which parsing should begin 
 			// also currently used as an optional field for JSON, if present will create a document each time that field is encountered
 			// (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one)
 			// (Also reused with completely different meaning for CSV - see below)
			// (In office mode, can be used to configure Tika - see below)
 		"XmlIgnoreValues" : [ "string" ], // XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. 
 		// (Also reused with completely different meaning for CSV)
 		"XmlSourceName" : "string", // If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
 		"XmlPrimaryKey" : "string", // Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
	}
}

Description

The file extractor ingests various file types from their locations and performs processing based on the configuration.

TODO: description of fields: renameAfterParse, pathInclude, patchExclude, (mode, is below)

"mode" (from v0.3) is only applied in JSON/XML/*sv modes
- In "normal" mode: any time a file containing records is modified then all already-imported records from that file are deleted/updated
- In "streaming" mode: the enclosing file of the records is ignored

Locations

The File Extractor is capable of ingesting files from the following locations:

Windows/Samba shares
harvester's local filesystem
Amazon S3

File Types

The File Extractor supports the following file types

Office documents (Word, Powerpoint etc.)
text-based documents (emails)
CSV
XML and JSON
Infinit.e shares
The results of Infinit.e plugins

Connecting to File Locations

The configuration will depend on the locations of the files you are trying to extract.

Local Filesystem

To connect to the text extractor's local filesystem the following url format must be used:

"file://<path including leading '/'>"

"file://" sources can only be run by administrators if secure mode is enabled (harvest.secure_mode=true in the configuration).

Local filesystem usage is mostly intended for testing, debugging, and "micro installations". The "tomcat" user must have read access to the directories and files on the path.

Infinit.e

You can connect the File extractor to Infinit.e shares and the results of custom jobs.

Infinit.e Shares

To connect to an Infinit.e share, the following url format must be used:

"inf://share/<shareid>/<ignored>"

The share id can be obtained in the url of the file uploader.

After the "<shareid>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.

The source must share at least one community with the share in order to be processed.

Infinit.e Jobs

To connect to an Infinit.e custom job, the following url format must be used:

"inf://custom/<customid-or-jobtitle>"

custom id and title can be obtained in the URL field of the p lugin manager.

After the "<customid-or-jobtitle>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.

The source must share at least one community with the custom plugin in order to be processed.

Windows/Samba

To connect to a Windows/Samba share, the following url format must be used:

"smb://server:port/path"

Amazon S3

To connect to an Amazon S3 location, the following url format must be used:

"s3://bucket_name>/" or "s3://<bucket_name>/path/"

The files in the S3 bucket should be readable by the account specified by the access key.

S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here.

Username/Password

A username/password is required to connect to your Amazon S3 environment.

For S3, the Access ID should be entered into the "username", and the Secret Key into the "password"

It is recommended for security that you create a separate AWS user with no permissions other than S3 read/list on the directories.

Domain

This field can be left blank for Amazon S3 environments.

File Types

This section describes the configurations for the various supported file types.

XML Files

The following code snippet can be used by way of example, to illustrate the use of the file extractor parameters on XML files. The sample code is used to act on an incident report.

{
    "description": "wits test",
    "isPublic": true,
    "mediaType": "Report",
    "searchCycle_secs": -1,
    "tags": [
        "incidents",
        "nctc",
        "terrorism",
        "wits",
        "events",
        "worldwide"
    ],
    "title": "wits test",
    "processingPipeline": [
        {
            "file": {
                "XmlIgnoreValues": [
                    "DefiningCharacteristicList",
                    "TargetedCharacteristicList",
                    "WeaponTypeList",
                    "PerpetratorList",
                    "VictimList",
                    "EventTypeList",
                    "CityStateProvinceList",
                    "FacilityList"
                ],
                "XmlPrimaryKey": "icn",
                "XmlRootLevelValues": [
                    "Incident"
                ],
                "XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://modus:139/wits/allfiles/"
            }
        },

In the example, the parameter XmlIgnoreValues is used to ignore certain xml nodes in the xml document.

Similarly, XmlRootLevelValues is used to specify the xml root level node at which parsing should begin.

XmlPrimaryKey identifies the primary key in the data set, and is use to help identify whether a record is new or previously harvested

XmlSourcename is used to build the new document url of the document that will be generated by the file extraction.

.SV Files

You can use XmlRootLevelValues to determine the root level field of the XML file at which parsing should begin.

For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
- eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
- By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
  - eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

For .sv files you can use the XmlSourcename parameter to build the document url.

XmlRootLevelValues must be set.

Office Files

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

There are currently 2 types of configuration supported:

"output:xml" or "output:html" to change the output of Tika from raw text to XML or HTML.
Strings of the format "MEDIATYPE:{ paramName: paramValue, ...}" - <MEDIATYPE> is in standard MIME format and determines which Tika element to configure; the paramNames and paramValues correspond to functions and arguments.

Examples:

Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false)

JSON/XML

The following code sample is used to parse a large selection of tweets using the file extractor.

"description": "A large set of tweets related to Super Storm Sandy",
    "isApproved": true,
    "isPublic": false,
    "mediaType": "Social",
    "tags": [
        "twitter",
        "gnip"
    ],
    "title": "Super Storm Sandy - Twitter: SANDY_SUBSTRING",
    "processingPipeline": [
        {
            "file": {
                "XmlPrimaryKey": "link",
                "XmlSourceName": "",
                "XmlRootLevelValues": [],
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://HOST:139/SHARE/PATH/TO/"
            }
        },

For JSON files the parameter XmlIgnoreValues is not applicable.

You can use XmlSourceName to build the document URL. If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey").

You can use the parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

For XML and JSON file where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. For JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

CSV Files

In the following sample code, the file extractor is configured to act on .csv content.

{
    "description": "For cyber demo",
    "isPublic": false,
    "mediaType": "Log",
    "searchCycle_secs": 3600,
    "tags": [
        "cyber",
        "structured"
    ],
    "title": "Cyber Logs Test",
    "processingPipeline": [
        {
            "file": {
                "XmlRootLevelValues": [],
                "domain": "DOMAIN",
                "password": "PASSWORD",
                "type": "csv",
                "username": "USER",
                "url": "smb://FILESHARE:139/cyber_logs/"
            }
        },

For "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

Footnotes:

Legacy documentation:

File object

Legacy documentation:

Using the File Harvester