Page Comparison

Overview

The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

...

"application/pdf:{'setEnableAutoSpace':false}"

where application/pdf will calll PDFParser. and setEnableAutoSpace(false)corresponds to a paramName and paramValue.

...

JSON/XML/CSV

To connect to these file types, the following url format must be used:

...

Code Block

{
    "description": "wits test",
    "isPublic": true,
    "mediaType": "Report",
    "searchCycle_secs": -1,
    "tags": [
        "incidents",
        "nctc",
        "terrorism",
        "wits",
        "events",
        "worldwide"
    ],
    "title": "wits test",
    "processingPipeline": [
        {
            "file": {
                "XmlIgnoreValues": [
                    "DefiningCharacteristicList",
                    "TargetedCharacteristicList",
                    "WeaponTypeList",
                    "PerpetratorList",
                    "VictimList",
                    "EventTypeList",
                    "CityStateProvinceList",
                    "FacilityList"
                ],
                "XmlPrimaryKey": "icn",
                "XmlRootLevelValues": [
                    "Incident"
                ],
                "XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://modus:139/wits/allfiles/"
            }
        },

Configuring CSV/SV

You can use XmlRootLevelValues to set the root object for CSV/SV file parsing.

When you do this, CSV parsing occurs automatically and the records are mapped into a metadata object called "csv" with the field names corresponding to the values of this array.

In the following sample code, the file extractor is configured to act on .csv content to set the root object and make additional configurations.

Code Block

{
    "description": "For cyber demo",
    "isPublic": false,
    "mediaType": "Log",
    "searchCycle_secs": 3600,
    "tags": [
        "cyber",
        "structured"
    ],
    "title": "Cyber Logs Test",
    "processingPipeline": [
        {
            "file": {
                "XmlRootLevelValues": [],
                "domain": "DOMAIN",
                "password": "PASSWORD",
                "type": "csv",
                "username": "USER",
                "url": "smb://FILESHARE:139/cyber_logs/"
            }
        },

Info

For "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

.SV Files

You can use XmlRootLevelValues to determine the root level field of the XML file at which parsing should begin.

For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

For .sv files you can use the XmlSourcename parameter to build the document url.

Using XmlIgnore Values to Derive Field Names Automatically

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.

Example:

If "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.

By default, the matching portion of the line (eg "#" in the example above) is removed.

To not remove it then simple place the value in quotes (using the specified quote char).

eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

TODO insert source example

Info

For "*csv" files where XmlRootLevelValues is set), where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

Panel
Footnotes: Legacy documentation: File object Legacy documentation: Using the File Harvester

...

Versions Compared

Old Version 22

New Version 23

Key

JSON/XML/CSV

Configuring CSV/SV

.SV Files

Using XmlIgnore Values to Derive Field Names Automatically