Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Windows/Samba shares
  • harvester's local filesystem
  • Amazon S3

...

File Types

The File Extractor supports the following file types

  • Office documents (Word, Powerpoint etc.)
  • text-based documents (emails)
  • CSV
  • XML and JSON
  • Infinit.e shares
  • The results of Infinit.e plugins

...

Connecting to File Locations

...

"file://<path including leading '/'>"

 

Noteinfo

"file://" sources can only be run by administrators if secure mode is enabled (harvest.secure_mode=true in the configuration).

Local filesystem usage is mostly intended for testing, debugging, and "micro installations". The "tomcat" user must have read access to the directories and files on the path.

...

The source must share at least one community with the custom plugin in order to be processed.

 

...

Windows/Samba

To connect to a Windows/Samba share, the following url format must be used:

"smb://server:port/path" 


...

Amazon S3

To connect to an Amazon S3 location, the following url format must be used:

...

The files in the S3 bucket should be readable by the account specified by the access key.

Noteinfo

S3 is not supported by default, the AWS SDK JAR must be copied into the classpath as described here.

...

For S3, the Access ID should be entered into the "username", and the Secret Key into the "password"

Noteinfo

It is recommended for security that you create a separate AWS user with no permissions other than S3 read/list on the directories.

...

This field can be left blank for Amazon S3 environments.

 

...

File Types

This section describes the configurations for the various supported file types.

...

For .sv files you can use the XmlSourcename parameter to build the document url.

 

Noteinfo

XmlRootLevelValues must be set.

...

You can use XmlPrimaryKey to help identify whether a record is new or previously harvested.  This requires tat that the parameter XmlRootLevelValues has been set.

 

...

Office Files 

You can use the XmlRootlevelValues parameter to configure Apache Tika for parsing of Office-type files.

...

Example: "application/pdf:{'setEnableAutoSpace':false}" ... will call PDFParser.setEnableAutoSpace(false)  


...

JSON/XML

 

For JSON files the parameter XmlIgnoreValues is not applicable.

...

You can usethe parameter XmlPrimaryKey to help identify whether a record is new or previously harvested.

 

Noteinfo

For XML and JSON file where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components.  For JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

 

...

CSV Files

 

Noteinfo

For "*csv" files where XmlRootLevelValues is set), Where the document(s) within the file references a unique network resource that is of the format "CONSTANT_URL_PATH + VARIABLE_ID" (eg "http://www.website.com?pageNum=3454354"), and the "VARIABLE_ID" component is one of the fields in the XML/JSON object, then "XmlSourceName" and "XmlPrimaryKey" can be used to specify the two components. Note that for JSON the dot notation can be used in "XmlPrimaryKey" for nested fields.

If it is not possible to specify the URL in this manner (but there is a single - not necessarily unique - URI that is related to the document - eg either a network resource or a file in a sub-directory of the fileshare), it is recommended to use the structured analysis handler to set the "displayUrl" parameter.

 

 

 

IN PROGRESS

Legacy documentation:

...