Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
{
	"docCount": integer, // total number of docs processed
	"fields": {
		string: { // the string is the field name, except with "." replaced with "%2e" as described above
			"fieldName": string, // the actual field name
 
			// Human readable output:
			"frequencyInfo": string, // Frequency information about the field, see below for format
			"typeInfo": [
				string, // Statistics on the different types of the field, see below for format
			],
			"sampleStringData": [
				string, // For string values, the "N" most common values ("N" taken from the configuration specified above, default 10)
			],
			"numericDataStats": string, //TODO (field_cnt=5 (parent_cnt=5); pct_parent=100.0%, pct_total=100.0%" For numeric fields, some simple statistics, see below for format
 
			// Machine readable output:
 
			"count": integer, // (see human readable format, frequencyInfo.field_cnt)
			"objectCount": integer, // (see human readable format, frequencyInfo.object_cnt) 
			"pctInParent": number, // (see human readable format, frequencyInfo.pct_parent)
			"pctInTotal": number, // (see human readable format, frequencyInfo.pct_total)
			"typeInfoexampleValues": [
				"value": string, // TODO object, array, string, text, number_float, bool (%, total)
			],
			"sampleStringData": [ One of the top occurring values (string fields only)
				"count": integer, // The number of times it occurs
			],
			"minValue": number, // For numeric values, the smallest value seen
			"maxValue": number, // For numeric values, the largest value seen
			"avgValue": number, // For numeric values, the mean value seen
			"numberCount": integer, // The total number of samples (in theory - all the times the field has a numeric value) used in the above statistics
 
			"typePcts": {
				string: number, //TODO example data (number) the string is the type name one of (object, array, string, text, number_float, number_int, bool)
								// the number is the % of the time the field is that type
			]},
			"numericDataStatstypeCounts": {
				string: integer, //TODO: "min=25.0 max=411.0 avg=164.8 as above, but count instead of %
			}
		}
	}
}

TODO human readable vs machine readable

TODOThe human readable version has the following formats:

  • frequencyInfo: "field_cnt=%1 (obj_cnt= $2% parent_cnt=$3%); pct_parent=$4, pct_total=%5"
    • field_cnt is the total number of instances of the field
    • obj_cnt is the number of types that the field contains an object (or array of objects) instead of a primitive value - discarded if == field_cnt
    • parent_cnt is the number of parent instances in which this field occurs (can be different to field_cnt it the parent_cnt is an array)
    • pct_parent is the % of parent fields in which this field occurs
    • pct_total is the % of objects (ie docCount) in which this field occurs
  • typeInfo: "<type> (<total_cnt>, <total_pct>%)"
    • type is one of: object, array, string, text, number_float, number_int, bool
    • total_cnt, total_pct: are the count and percentage of that type for that field
      • Note that types can overlap, eg [ 1, 2, "3" ] would be 3x array, 3x numeric, 1x string
  • sampleStringData: "<value> (<count>)"
  • numericDataStats: "min=$1 max=$2 avg=$3"
    • min is the smallest value seen
    • max is the largest value seen
    • avg is the mean value seen