Overview of Documents and Aggregations

Overview

The Community Edition platform contains several different document types:

  • Matching documents
  • Top documents
  • Filtered documents

When a query is issued, often a large number of documents will satisfy the query criteria (particularly for a common query like "obama"), these are called matching documents. These documents are not directly available to the widget (apart from top documents, see below).

However there are normally too many documents for a person to analyze directly (see below, under aggregations). As a result, a ranked subset of these matching documents (according to a configurable scoring method) is retrieved and only these are returned directly to the GUI. The default number of these top documents returned is 100.

The platform allows for further filtering of these top documents within the GUI framework, eg containing a specific set of entities (eg click on one of the bars in the graph in the "Significance" widget). This sub-set is called the filtered documents.

Aggregations

All the matching documents can contribute to the "knowledge" that a query can provide, and the documents themselves are not the only objects returned from a query. Instead, relevant information to the analysis is summed/averaged/etc ("aggregated") across all matching documents, and these are referred to as the "aggregations". Examples include:

It should be noted that aggregations are sometimes ranked by frequency, sometimes summed by significance. This distinction will become more consistent in future versions of the tool.

Finally, note that the idea of an aggregation is valid across all three of the documents sets described above (matching, top, filtered).