...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
This page has been broken down into the following sections for ease of localization.
Table of Contents |
---|
Format
Code Block | ||
---|---|---|
| ||
{
"display": string,
"file":
{
"username" : "string", // Username for file share authentication,
"password" : "string", // Password for file share authentication,
"domain" : "string", // Domain location of the file share,
"pathInclude": "string", // Optional - regex, only files with complete paths matching the regular expression are processed further
"pathExclude": "string', // Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed)
"renameAfterParse" "string", // Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file
// (eg "$path/processed/$name")
"type": "string", // One of "json", "xml", "tika", "*sv", or null to auto decide
"mode": "string", // "normal" (defaults if mode not present), "streaming", see below
"XmlRootLevelValues" : [ "string" ], // The root level value of XML to which parsing should begin
// also currently used as an optional field for JSON, if present will create a document each time that field is encountered
// (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one)
// (Also reused with completely different meaning for CSV - see below)
// (In office mode, can be used to configure Tika - see below)
"XmlIgnoreValues" : [ "string" ], // XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level.
// (Also reused with completely different meaning for CSV)
"XmlSourceName" : "string", // If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
"XmlPrimaryKey" : "string", // Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
|
...
...
XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. // (Also reused with completely different meaning for CSV)
...
If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV.
...
Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV.
...
- For "*sv" files when XmlRootLevelValues is set controls the separators as follows: the first char in the string is the separator, the (optional) second char in the string is the quote, and the (optional) third char in the string is the escape character (eg the default is ",\"\\")
For XML only, this string is pre-pended to XML attributes before they become JSON fields.
...
Connecting to File Locations
The configuration will depend on the locations of the files you are trying to extract.
Local Filesystem
To connect to the text extractor's local filesystem the following url format must be used:
"file://<path including leading '/'>"
Info |
---|
"file://" sources can only be run by administrators if secure mode is enabled (harvest.secure_mode=true in the configuration). Local filesystem usage is mostly intended for testing, debugging, and "micro installations". The "tomcat" user must have read access to the directories and files on the path. |
Infinit.e
You can connect the File extractor to Infinit.e shares and the results of custom jobs.
Infinit.e Shares
To connect to an Infinit.e share, the following url format must be used:
"inf://share/<shareid>/<ignored>"
The share id can be obtained in the url of the file uploader.
After the "<shareid>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.
The source must share at least one community with the share in order to be processed.
Infinit.e Jobs
To connect to an Infinit.e custom job, the following url format must be used:
"inf://custom/<customid-or-jobtitle>"
custom id and title can be obtained in the URL field of the plugin manager.
After the "<customid-or-jobtitle>/" portion of the URL, any arbitrary text can be added to make the role of the share clearer. This text is ignored by the file harvester.
The source must share at least one community with the custom plugin in order to be processed.
Windows/Samba
...
"XmlPreserveCase": boolean, // default false, converts everything to lower case
"XmlAttributePrefix": "string", // default: null - if enabled, attributes are converted into tags with this prefix
}
} |
...
...
...
Amazon S3
To connect to an Amazon S3 location, the following url format must be used:
...
...
The files in the S3 bucket should be readable by the account specified by the access key.
Info |
---|
S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here. |
Username/Password
A username/password is required to connect to your Amazon S3 environment.
For S3, the Access ID should be entered into the "username", and the Secret Key into the "password"
Info |
---|
It is recommended for security that you create a separate AWS user with no permissions other than S3 read/list on the directories. |
Domain
This field can be left blank for Amazon S3 environments.
File Types
This section describes the configurations for the various supported file types.
Office Files
...
- the path of the file (ie file.url + path-relative-to-url)
Example:
Connects to an "office" document on a samba drive.
"smb://modus:139/enron/enron_mail_20110402/maildir/"
Configuring Apache Tika
You can use the XmlRootlevelValues
parameter to configure Apache Tika for parsing of Office-type files.
XmlRootlevelValues will accept a sting value whcih can be used to pass configuration values to Apache's Tika module. Currently there are supported configurations:
Configuring Tika Output Format
You can include the string "output:xml" or "output:html" to change the output of Tika from raw text to XML or HTML.
Configuring Tika Elements
You can configure Tika Elemenst by using paramName and paramValue to send functions and arguments to Tika.
The string must be in the following format
...
<MEDIATYPE> is in standard MIME format and determines which Tika element to configure; the paramNames and paramValues correspond to functions and arguments.
Example:
...
where application/pdf will calll PDFParser. and setEnableAutoSpace(false)corresponds to a paramName and paramValue.
JSON/XML/CSV
To connect to these file types, the following url format must be used:
- path-of-file (as above) + <hash of object> + ".csv"/.json/.xml
Example:
To connect to a samba fileshare
"url": "smb://FILESHARE:139/cyber_logs/"
If xmlsourcename and xmlprimarykey are specified, the following url format must be used
- xmlsourcename + object,get(xmlprimarykey)
Example:
"url": "smb://HOST:139/SHARE/PATH/TO/"
Configuring JSON
You can use the file extractor to configure the root JSON object for parsing.
In the example below, the parameter XMLRootlevelValues
is used to set the root object.
In addition, you can use XmlSourceName
to build the document URL. If specified, the document URL is built as "XmlSourceName" + xml("XmlPrimaryKey").
You can use the parameter XmlPrimaryKey
to help identify whether a record is new or previously harvested.
Info |
---|
For JSON files, the parameter XmlIgnoreValues is not applicable. |
Code Block |
---|
"description": "A large set of tweets related to Super Storm Sandy",
"isApproved": true,
"isPublic": false,
"mediaType": "Social",
"tags": [
"twitter",
"gnip"
],
"title": "Super Storm Sandy - Twitter: SANDY_SUBSTRING",
"processingPipeline": [
{
"file": {
"XmlPrimaryKey": "link",
"XmlSourceName": "",
"XmlRootLevelValues": [],
"domain": "XXX",
"password": "XXX",
"username": "XXX",
"url": "smb://HOST:139/SHARE/PATH/TO/"
}
}, |
Configuring XML
You can use XmlRootLevelValues
to set the root object for xml file parsing.
In the example below, the field "Incident" is set as the root object.
In addition, the parameter XmlIgnoreValues
is used to ignore certain xml nodes in the xml document.
XmlPrimaryKey
identifies the primary key in the data set, and is use to help identify whether a record is new or previously harvested
XmlSourcename is used to build the new document url of the document that will be generated by the file extraction.
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Configuring SV
For SV type files, the root level values and field names can be set manually or automatically.
Specifying the Field Names Manually
You can use XmlRootLevelValues
to set the root level values/field names. When you do this, CSV parsing occurs automatically and the records are mapped into a metadata object called "csv" with the field names corresponding to the values of this array.
In the source example below, the field names will correspond to the included array: "device","date", "srcIP" etc.
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Deriving Field Names Automatically
The field names can also be derived automatically from the headers.
If you take this approach, it is necessary to both identify the header row and to specify how the field names will be identified.
Header Prefixed by String:
If XmlIgnoreValues
is set to: "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
By default, the matching portion of the line (eg "#" in the example above) is removed.
To not remove it then simple place the value in quotes (using the specified quote char).
eg. assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"
In the example log file below, the header row is prefixed by '#'.
Code Block |
---|
#Date,Device,SrcIP,dstIP,Alert,Country
SCANNER_1,2012-01-01T13:43:00,10.0.0.1,66.66.66.66,DUMMY_ALERT_TYPE_1,United States |
In the example source below, XmlIgnoreValues
automatically identifies the header using "#". This also identifies the field names using the separator ",".
Code Block |
---|
"processingPipeline": [ {
"file": {
"XmlIgnoreValues": [
"#"
],
"domain": "DOMAIN",
"password": "PASSWORD",
"type": "csv",
"username": "USER",
"url": "smb://FILESHARE:139/cyber_logs/"
}
}, |
Header Not Prefixed by String:
In the case where the first header is not prefixed by a string, it is still necessary to identify it as the header row.
For example, consider a header row formatted as follows:
"field1,field2,field3"
In this case, XmlIgnorevalues
should be set to the following: [ "\"field1\" ]
This identifies the header row and preserves #field1
Panel |
---|
Footnotes:
Legacy documentation: |
Legacy documentation:
...
Using the File Harvester