Using the Database Harvester
There is a separate reference page for the Database Harvester configuration object.
Infinit.e supports harvesting data from traditional RDBMS (Relational Database Management Systems) using JDBC (Java Database Connectivity) drivers. The Sample Database Harvester Specification below demonstrates how to connect to and extract data from a database using the harvester:
Sample Database Harvester Specification
source : { ... "extractType" : "Database", "authentication" : { "username" : "username", "password" : "password"}, "database" : { "databaseType" : "mysql", "hostname" : "my.databaseserver.com", "port" : "3306" "databaseName" : "database", "query" : "SELECT * FROM IncidentReport", "deltaQuery" : "SELECT * FROM IncidentReport WHERE REPORTDATETIME >= (SELECT ADDDATE(CURDATE(),-7))", "deleteQuery" : "", "primaryKey" : "NID", "title" : "CCN", "snippet" : "OFFENSE", "publishedDate" : "REPORTDATETIME" }, "useExtractor" : "none", ... }
- extractType
The extractType field is used to tell the harvester the type of source to extract from, i.e.: Database. Other valid values include: File, Feed, etc. - authentication
The Authentication object of the Source document is a subset of the full Authentication object in that it only uses the 'username' and 'password' fields. The Database Harvester uses the username and password from the Authentication object as database credentials (if needed).- username
- password - needs to be encrypted download the jasypt command line utility to encrypt (link)
- database
The Database object is used to specify of how to access the data to be extracted and how to extract the individual fields within the source file data records.- databaseType
The type of RDBMS to connect to. Valid values currently include: mysql, db2, oracle, mssqlserver, sybase. - hostname
The hostname of the database server to connect to, i.e. "my.databaseserver.com" in the example above. - port
The port that the database accepts incoming connections on. - databaseName
The name of the database to connect to. - query
The query field is used to specify the SQL used to perform a full extraction of data for the source. This is generally used the first time the harvester extracts data from a source with incremental extractions being specified using the deltaQuery below. - deltaQuery
The deltaQuery field is used to specify the SQL that extracts data from the source RDBS based on one or more delta values, i.e. created or modified date for a record. - deleteQuery
Note: The deleteQuery functionality of the Database Harvester is not implemented in the Beta version of Infinit.e. - primaryKey
Primary key field in data set, used to help identify whether a record is new or previously harvested. - title
Record field used to populate the document's title field. - snippet
Record field used to populate the document's description field. - publishedDate
Record field used to populate the document's published date field.
- databaseType
- useExtractor
Additional extractor to use (i.e. other than, or in addition to, the Structured Analysis Harvester) to use to extracto entity and event data.
Note: A complete example of the above source including a sample database document harvested from the source can be found here: Sample Database Source.