...
The file extractor ingests various file types from their locations and performs processing based on the configuration.
...
Username for file share authentication
...
Password for file share authentication
...
The File Extractor is capable of ingesting files from the following locations:
- Windows/Samba shares
- harvester's local filesystem
- Amazon S3
The File Extractor supports the following file types
- Office documents (Word, Powerpoint etc.)
- text-based documents (emails)
- CSV
- XML and JSON
- Infinit.e shares
- The results of Infinit.e plugins
The following table describes the parameters of the file extractor configuration.
Field | Description | Data Type | ||
---|---|---|---|---|
username | Username for file share authentication | string | ||
password | Password for file share authentication | string | ||
domain | Domain location of the file share | string | ||
pathInclude | Optional - regex, only files with complete paths matching the regular expression are processed further | string | ||
pathExclude | Optional - regex, files with complete paths matching the regular expression are ignored (and matching directories are not traversed) | string | ||
renameAfterParse | Optional, renames files after they have been ingested - the substitution variables "$name" and "$path" are supported; or "" or "." deletes the file // (eg "$path/processed/$name") | string | ||
type | One of "json", "xml", "tika", "*sv", or null to auto decide | string | ||
mode | "normal" (defaults if mode not present), "streaming", see below "mode" (from v0.3) is only applied in JSON/XML/*sv modes
| string | ||
XMLRootLevelValues | The root level value of XML to which parsing should begin // also currently used as an optional field for JSON, if present will create a document each time that field is encountered // (if left blank for JSON, assumes the file consists of a list of concatenated JSON objects and creates a document from each one) // (Also reused with completely different meaning for CSV - see below) // (In office mode, can be used to configure Tika - see below) | string | ||
XmlIgnoreValues | XML values that, when parsed, will be ignored - child elements will still be part of the document metadata, just promoted to the parent level. // (Also reused with completely different meaning for CSV) | string | ||
XmlSourceName | If present, and a primary key specified below is also found then the URL gets built as XmlSourceName + xml[XmlPrimaryKey], Also supported for JSON and CSV. | string | ||
XmlPrimaryKey | Parent to XmlRootLevelValues. This key is used to build the URL as described above. Also supported for JSON and CSV. | string |
Locations
The File Extractor is capable of ingesting files from the following locations:
...
File Types
The File Extractor supports the following file types
...
Connecting to File Locations
...
This field can be left blank for Amazon S3 environments.
File Types
This section describes the configurations for the various supported file types.
...