Overview
The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.
...
"application/pdf:{'setEnableAutoSpace':false}"
where application/pdf will calll PDFParser. and setEnableAutoSpace(false)corresponds to a paramName and paramValue.
...
JSON/XML/CSV
To connect to these file types, the following url format must be used:
...
Code Block |
---|
{ "description": "wits test", "isPublic": true, "mediaType": "Report", "searchCycle_secs": -1, "tags": [ "incidents", "nctc", "terrorism", "wits", "events", "worldwide" ], "title": "wits test", "processingPipeline": [ { "file": { "XmlIgnoreValues": [ "DefiningCharacteristicList", "TargetedCharacteristicList", "WeaponTypeList", "PerpetratorList", "VictimList", "EventTypeList", "CityStateProvinceList", "FacilityList" ], "XmlPrimaryKey": "icn", "XmlRootLevelValues": [ "Incident" ], "XmlSourceName": "https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0&Ntk=ICN&Ntx=mode%20match&Ntt=", "domain": "XXX", "password": "XXX", "username": "XXX", "url": "smb://modus:139/wits/allfiles/" } }, |
Configuring CSV/SV
You can use XmlRootLevelValues
to set the root object for CSV/SV file parsing.
When you do this, CSV parsing occurs automatically and the records are mapped into a metadata object called "csv" with the field names corresponding to the values of this array.
In the following sample code, the file extractor is configured to act on .csv content to set the root object and make additional configurations.
Code Block |
---|
{ "description": "For cyber demo", "isPublic": false, "mediaType": "Log", "searchCycle_secs": 3600, "tags": [ "cyber", "structured" ], "title": "Cyber Logs Test", "processingPipeline": [ { "file": { "XmlRootLevelValues": [], "domain": "DOMAIN", "password": "PASSWORD", "type": "csv", "username": "USER", "url": "smb://FILESHARE:139/cyber_logs/" } }, |
Info |
---|
.SV Files
You can use XmlRootLevelValues
to determine the root level field of the XML file at which parsing should begin.
For "*sv" files, this results in CSV parsing occurring automatically, and the records are mapped into a metadata object called "csv", with the fieldnames corresponding to the values of this array (eg the 3rd value is named after XmlRootLevelValues[2] etc)
The fieldnames can also be derived automatically by setting XmlIgnoreValues
. In this case, XmlRootLevelValues
need not be set.
For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.
- In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
- eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
- By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
- eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"
For .sv files you can use the XmlSourcename
parameter to build the document url.
Using XmlIgnore Values to Derive Field Names Automatically
The fieldnames can also be derived automatically by setting XmlIgnoreValues
. In this case, XmlRootLevelValues
need not be set.
For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.
In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
Example:
If "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
By default, the matching portion of the line (eg "#" in the example above) is removed.
To not remove it then simple place the value in quotes (using the specified quote char).
eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"
TODO insert source example
Info |
---|
Panel |
---|
Footnotes: |
...
Legacy documentation:
File object
Legacy documentation: