Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This section describes the configurations for the various supported file types.

...

XML Files

You can use XmlRootLevelValues to determine the root level field of the XML file at which parsing should begin.

For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

XmlIgnoreValues

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

  • In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
    • eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
    • By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
      • eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

XmlSourceName

For .sv files you can use the XmlSourcename parameter to build the document url.

 

Info

XmlRootLevelValues must be set.

...

You can use XmlPrimaryKey to help identify whether a record is new or previously harvested.  This requires tat that the parameter XmlRootLevelValues has been set.

 

Office Files 

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

There are currently 2 types of configuration supported:

...

Examples:

Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false)

JSON/XML

 

For JSON files the parameter XmlIgnoreValues is not applicable.

You can use XmlSourceName to build the document url.  If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey").

You can usethe parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

 

Info

For XML and JSON file where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components.  For JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

 

...

The following code snippet can be used by way of example, to illustrate the use of the file extractor parameters on XML files.  The sample code is used to act on an incident report.

Code Block
{
    "description": "wits test",
    "isPublic": true,
    "mediaType": "Report",
    "searchCycle_secs": -1,
    "tags": [
        "incidents",
        "nctc",
        "terrorism",
        "wits",
        "events",
        "worldwide"
    ],
    "title": "wits test",
    "processingPipeline": [
        {
            "file": {
                "XmlIgnoreValues": [
                    "DefiningCharacteristicList",
                    "TargetedCharacteristicList",
                    "WeaponTypeList",
                    "PerpetratorList",
                    "VictimList",
                    "EventTypeList",
                    "CityStateProvinceList",
                    "FacilityList"
                ],
                "XmlPrimaryKey": "icn",
                "XmlRootLevelValues": [
                    "Incident"
                ],
                "XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=",
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://modus:139/wits/allfiles/"
            }
        },

 

In the example, the parameter XmlIgnoreValues is used to ignore certain xml nodes in the xml document.

Similarly, XmlRootLevelValues is used to specify the xml root level node at which parsing should begin.

XmlPrimaryKey identifies the primary key in the data set, and is use to help identify whether a record is new or previously harvested

XmlSourcename is used to build the new document url of the document that will be generated by the file extraction.

.SV Files

You can use XmlRootLevelValues to determine the root level field of the XML file at which parsing should begin.

For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

 

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

  • In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
    • eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
    • By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
      • eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

For .sv files you can use the XmlSourcename parameter to build the document url.

 

Info

XmlRootLevelValues must be set.

 

...

Office Files 

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

There are currently 2 types of configuration supported:

  • "output:xml" or "output:html" to change the output of Tika from raw text to XML or HTML.
  • Strings of the format "MEDIATYPE:{ paramName: paramValue, ...}" - <MEDIATYPE> is in standard MIME format and determines which Tika element to configure; the paramNames and paramValues correspond to functions and arguments.

Examples:

Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false)


...

JSON/XML

The following code sample is used to parse a large selection of tweets using the file extractor.

Code Block
"description": "A large set of tweets related to Super Storm Sandy",
    "isApproved": true,
    "isPublic": false,
    "mediaType": "Social",
    "tags": [
        "twitter",
        "gnip"
    ],
    "title": "Super Storm Sandy - Twitter: SANDY_SUBSTRING",
    "processingPipeline": [
        {
            "file": {
                "XmlPrimaryKey": "link",
                "XmlSourceName": "",
                "XmlRootLevelValues": [],
                "domain": "XXX",
                "password": "XXX",
                "username": "XXX",
                "url": "smb://HOST:139/SHARE/PATH/TO/"
            }
        },

 

 

For JSON files the parameter XmlIgnoreValues is not applicable.

You can use XmlSourceName to build the document url.  If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey").

You can use the parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

 

Info

For XML and JSON file where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components.  For JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

 

...

CSV Files

In th efollowing sample code, th efile extractor is configured to act on .csv content.

Code Block
{
    "description": "For cyber demo",
    "isPublic": false,
    "mediaType": "Log",
    "searchCycle_secs": 3600,
    "tags": [
        "cyber",
        "structured"
    ],
    "title": "Cyber Logs Test",
    "processingPipeline": [
        {
            "file": {
                "XmlRootLevelValues": [],
                "domain": "DOMAIN",
                "password": "PASSWORD",
                "type": "csv",
                "username": "USER",
                "url": "smb://FILESHARE:139/cyber_logs/"
            }
        },

 

 

Info

For "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

...