Using the Database Harvester

There is a separate reference page for the Database Harvester configuration object.

Infinit.e supports harvesting data from traditional RDBMS (Relational Database Management Systems) using JDBC (Java Database Connectivity) drivers. The Sample Database Harvester Specification below demonstrates how to connect to and extract data from a database using the harvester:

Sample Database Harvester Specification
source : {
   ... 
   "extractType" : "Database",
   "authentication" : {
       "username" : "username", 
       "password" : "password"}, 
   "database" : {
       "databaseType" : "mysql",
       "hostname" : "my.databaseserver.com",
       "port" : "3306"
       "databaseName" : "database",
       "query" : "SELECT * FROM IncidentReport", 
       "deltaQuery" : "SELECT * FROM IncidentReport WHERE REPORTDATETIME >= (SELECT ADDDATE(CURDATE(),-7))",
       "deleteQuery" : "",
       "primaryKey" : "NID",
       "title" : "CCN",
       "snippet" : "OFFENSE",
       "publishedDate" : "REPORTDATETIME"
   }, 
   "useExtractor" : "none",
   ...
}
  • extractType
    The extractType field is used to tell the harvester the type of source to extract from, i.e.: Database. Other valid values include: File, Feed, etc.
  • authentication
    The Authentication object of the Source document is a subset of the full Authentication object in that it only uses the 'username' and 'password' fields. The Database Harvester uses the username and password from the Authentication object as database credentials (if needed).
  • database
    The Database object is used to specify of how to access the data to be extracted and how to extract the individual fields within the source file data records.
    • databaseType
      The type of RDBMS to connect to. Valid values currently include: mysql, db2, oracle, mssqlserver, sybase.
    • hostname
      The hostname of the database server to connect to, i.e. "my.databaseserver.com" in the example above.
    • port
      The port that the database accepts incoming connections on.
    • databaseName
      The name of the database to connect to.
    • query
      The query field is used to specify the SQL used to perform a full extraction of data for the source. This is generally used the first time the harvester extracts data from a source with incremental extractions being specified using the deltaQuery below.
    • deltaQuery
      The deltaQuery field is used to specify the SQL that extracts data from the source RDBS based on one or more delta values, i.e. created or modified date for a record.
    • deleteQuery
      Note: The deleteQuery functionality of the Database Harvester is not implemented in the Beta version of Infinit.e.
    • primaryKey
      Primary key field in data set, used to help identify whether a record is new or previously harvested.
    • title
      Record field used to populate the document's title field.
    • snippet
      Record field used to populate the document's description field.
    • publishedDate
      Record field used to populate the document's published date field.
  • useExtractor
    Additional extractor to use (i.e. other than, or in addition to, the Structured Analysis Harvester) to use to extracto entity and event data.

Note: A complete example of the above source including a sample database document harvested from the source can be found here: Sample Database Source.