Overview

Extracts The File Extractor ingests documents from local files, fileshares, S3 repositories, and Infinit.e shares (eg uploaded via the file uploader). It can also be used to ingest the output of custom analytic plugins.

This page has been broken down into the following sections for ease of localization.

Table of Contents

TODO

Format

...

Connecting to File Locations

the The configuration will depend on the locations of the files you are trying to extract.

...

Note
S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here.

...

Username/

...

Password

A username/password is required to connect to your Amazon S3 environment.

...

Note
It is recommended for security that you create a separate AWS user with no permissisons permissions other than S3 read/list on the directories.

...

Domain

This field can be left blank for Amazon S3 environments.

...

The fieldnames can also be derived automatically by setting XmlIgnoreValues. In this case, XmlRootLevelValues need not be set.

XmlIgnoreValues

...

For "*sv" files the start of each line is compared to each of the strings in this array - if they match the line is ignored. This allows header lines to be ignored.

In addition, the first line matching an ignore value field that consists of the more than 1 token-separated field will be used to generate the fieldnames.
- eg if "XmlIgnoreValues": "#", and the first three lines are "#", "#header", and "#field1,field2,field3" then the processing will assume the 3 fields are field1, field2, and field3.
- By default, the matching portion of the line (eg "#" in the example above) is removed. To not remove it then simple place the value in quotes (using the specified quote char)
  - eg assuming the quote char is ', then "`#`" in the above example would return 3 fields: "#field1", "field2" and "field3"

XmlSourceName

For .sv files you can use the XmlSourcename parameter to build the document url.

Note
XmlRootLevelValues must be set.

XmlPrimaryKey

You can use XmlPrimaryKey to help identify whether a record is new or previously harvested. This requires tat that the parameter XmlRootLevelValues has been set.

Office Files

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

...

Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false)

JSON/XML

For JSON files the parameter XmlIgnoreValues is not applicable.

You can use XmlSourceName to build the document url. If specified, the document URL is build as "XmlSourceName" + xml("XmlPrimaryKey").

You can usethe parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

Note

For XML and JSON file where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. For JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

CSV Files

Note

For "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

IN PROGRESS

Legacy documentation:

...

Versions Compared

Old Version 9

New Version 10

Key

Overview

Format

Connecting to File Locations

Username/

Password

Domain

XmlIgnoreValues

XmlSourceName

XmlPrimaryKey

Office Files

JSON/XML

CSV Files

Page Comparison

Versions Compared

Old Version 9

New Version 10

Key

Overview

Format

Connecting to File Locations

Username/

Password

Domain

XmlIgnoreValues

XmlSourceName

XmlPrimaryKey

Office Files

JSON/XML

CSV Files