Record objects

Overview

A new type of object available for Community Edition (CE) versions >= v0.3 provided they are running elasticsearch versions >= 1.0.

The idea is to provide a lighter weight CE object for smaller structures, where the output of analyses will be term or record volumes or statistics rather than anything higher level as is often the case with more document centric analysis.

CE records are only stored in elasticsearch, our real-time index, and are (by default) only stored for 30 days. Compare this to documents, which are stored until manually deleted, have a defined metadata format into which incoming data is restored, and is synchronized between MongoDB (our persistent data store) and elasticsearch. The result is that records are more limited, and much more efficient in terms of search and aggregation speed for a given hardware footprint.

Their simplicity can be seen in the format below (note this is as retrieved from a query - only the "_source" is actually stored in elasticsearch):

{
	"_index": "string", // The index name in which the record is saved, eg "recs_51a60d9ee4b05fca332279a1" (stashed) or "recs_t_51a60d9ee4b05fca332279a1_2014.04.14" (live)
	"_type": "string", // The type within the above index (normally set manually via logstash), eg "netflow"
	"_id": "string", // the unique key assigned by elasticsearch to each record
	"_source": {
		"@timestamp": "string format:YYYY-MM-DDTHH:mm:SS.sssZ", // Not mandatory but will always be present when harvested from logstash
		"@version": 1,
		"sourceKey": "string", // The Infinit.e source key responsible for ingesting this record
		// Any other fields in the JSON object that is ingested or transformed via logstash. Can be either objects or atomic
	}	
}

Field Guide

Note that logstash performs the following transform on certain field names or data types:

  • "@timestamp" - transformed into an elasticsearch date
  • "geoip" - converted to elasticsearch "geo" type
  • string fields, say <fieldname> generates 2 fields:
    • "<fieldname>" is decomposed ("analyzed" in Lucene parlance) by " "s and "."s
    • "<fieldname>.raw" is left alone
    • Except:
      • "sourceKey" is left alone and has no ".raw" version
  • other field types: left alone

Records are currently stored in one of the following 2 sets of indexes in elasticsearch:

  • "Streaming" records are stored in "recs_t_<community id>_<date>" where date is in the format "YYYY.MM.DD" (and only the most recent 30 days are stored)
    • (on the roadmap: support other time slice periods, eg hours, and support different max storage periods)
  • "Stashed" records are stored in "recs_<community id>" (and are kept forever - note this must be used with care to avoid overfilling a given index)

In the future there will be other indexes where certain "built-in" record formats will be stored (eg entity mentions/sentiments).

Currently Infinit.e records can only be ingested via the Logstash harvester, and can only be viewed by the Kibana "Record Analyzer" widget (See info box below). Further Integration between records and documents and the source and knowledge APIs is forthcoming.

The elasticsearch proxy used by the Kibana widget ("<ROOT URL>/infinit.e.records/") is currently an open but undocumented interface. It will eventually be productionized and brought into the default Infinit.e API, as "Knowledge - Record", eg "/api/knowledge/record/query"