Overview

The dataset summarizer is an analytic module accessible via the GUI Utilities - Plugin Manager. It is intended to give analysts a quick understanding of large unfamiliar semi-structured datasets.

It generates a JSON object that describes all the different fields in semi-structured data - their types, example values, statistics, etc.

It can be run on JSON/CSV/XML files stored in HDFS directories, ingested "infinit.e" documents, and the output of other custom jobs (records: coming soon!)

Running the dataset summarizer

TODO

1) select, copy, change title

2) set query

3) configure

4) run

Configuring the dataset summarizer

JSON format: to be copied into the "User argument" text area of the plugin manager.

{
	"numTopValues": integer, // For string fields, the number of "top" (highest frequency) values to display (defaults to 10)
	"numTopValuesOverride": {
		string: integer, // Enables the user to specify custom "top values" for different fields (in dot-notation, except with "." replaced with "%2e", eg "field1%2efield2" not "field1.field2")
	},
	"humanReadable": boolean, // If true (the default), then the JSON is formatted for humans to read easily
	"machineReadable": boolean // If true (default: false) then will output the JSON is a machine-readable format for follow-on processing
}

The dataset summarizer output

The data summarizer generates a the following JSON object

{
	"docCount": integer, // total number of docs processed
	"fields": {
		string: { // the string is the field name, except with "." replaced with "%2e" as described above
			"fieldName": string, // the actual field name
 
			// Human readable output:
			"frequencyInfo": string, // Frequency information about the field, see below for format
			"typeInfo": [
				string, // Statistics on the different types of the field, see below for format
			],
			"sampleStringData": [
				string, // For string values, the "N" most common values ("N" taken from the configuration specified above, default 10)
			],
			"numericDataStats": string, // For numeric fields, some simple statistics, see below for format
 
			// Machine readable output:
 
			"count": integer, // (see human readable format, frequencyInfo.field_cnt)
			"objectCount": integer, // (see human readable format, frequencyInfo.object_cnt) 
			"pctInParent": number, // (see human readable format, frequencyInfo.pct_parent)
			"pctInTotal": number, // (see human readable format, frequencyInfo.pct_total)
			"exampleValues": [
				"value": string, // One of the top occurring values (string fields only)
				"count": integer, // The number of times it occurs
			],
			"minValue": number, // For numeric values, the smallest value seen
			"maxValue": number, // For numeric values, the largest value seen
			"avgValue": number, // For numeric values, the mean value seen
			"numberCount": integer, // The total number of samples (in theory - all the times the field has a numeric value) used in the above statistics
 
			"typePcts": {
				string: number, // the string is the type name one of (object, array, string, text, number_float, number_int, bool)
								// the number is the % of the time the field is that type
			},
			"typeCounts": {
				string: integer, // as above, but count instead of %
			}
		}
	}
}

The human readable version has the following formats:

frequencyInfo: "field_cnt=%1 (obj_cnt= $2% parent_cnt=$3%); pct_parent=$4, pct_total=%5"
- field_cnt is the total number of instances of the field
- obj_cnt is the number of types that the field contains an object (or array of objects) instead of a primitive value - discarded if == field_cnt
- parent_cnt is the number of parent instances in which this field occurs (can be different to field_cnt it the parent_cnt is an array)
- pct_parent is the % of parent fields in which this field occurs
- pct_total is the % of objects (ie docCount) in which this field occurs
typeInfo: "<type> (<total_cnt>, <total_pct>%)"
- type is one of: object, array, string, text, number_float, number_int, bool
- total_cnt, total_pct: are the count and percentage of that type for that field
  - Note that types can overlap, eg [ 1, 2, "3" ] would be 3x array, 3x numeric, 1x string
sampleStringData: "<value> (<count>)"
numericDataStats: "min=$1 max=$2 avg=$3"
- min is the smallest value seen
- max is the largest value seen
- avg is the mean value seen

Dataset summarizer

Overview

Running the dataset summarizer

Configuring the dataset summarizer

The dataset summarizer output