Infinit.e - top level design

Overview

The purpose of this page is to provide a fairly high level view of Infinit.e's logical and data architecture. This is provided for a few reasons:

Curiosity!
A better technical understanding of how Infinit.e works
A better understanding of the different OSS technologies used in Infinit.e
As a navigational aid for the source code.

All of the diagrams below are thumbnails, click on the zoom button to expand.

Starting from the overview of what Infinit.e does:

The following shows the component level design of Infinit.e:

How the internal and external/commercial and OSS components map onto the logical functions:

On the left, the design roadmap; to the right, where the current release is:

It can be seen the current roadmap plans have nearly been achieved:

For system monitoring only, the commercial Splunk application is still used instead of the OSS "equivalent" Logstash. (This is low priority; the free Splunk version works very nicely).
- Logstash is used to ingest the record objects present in the real-time store (as opposed to documents that are in both the real-time store and the document store)
The Hadoop integration is still quite basic, eg with no explict Mahout support.
Nutch has not been integrated into the harvester.
A graph framework has not yet been implemented - we're still deciding between a few options for technology.
The enrichment framework is currently "proprietary" (though very simple) rather than using UIMA. (Again this is low priority, UIMA doesn't actually seem to give us much in terms of ready made extractor)
Not all the visualization frameworks (Drupal, OWL) have been integrated.

There's a more detailed view of the different data storage layers here.

How this typically maps onto nodes in a logical cluster:

Although any of the components can be run on any LAN-connected nodes, normally clusters consist of:

API nodes: which contains most of the logic, serves API requests, and contains the real-time index.
DB nodes: which contain all the stores and serves DB requests via the API.

How the cluster scales:

An important element to modern analysis frameworks is their ability to scale massively to handle whatever volumes of data are necessary, up to "Internet scale". The following diagram, taken from the EC2 installation guide, illustrates this for Infinit.e (click zoom to expand):

This diagram shows 4 types of nodes:

Load balancers, which round robin user requests to the API nodes.
- DNS round robin can be used as an alternative or a complement to load balancers. An example of a good free load balancer application that can be run on low-end hardware is HA Proxy.
API nodes, which handle user requests, perform some per-query processing, perform custom processing via Hadoop, harvest and enrich new sources, and host the real-time Lucene indexes.
DB nodes, which stores and serves all the data and manages statistics updates generated from the source harvesting.
Config server nodes, which manage sharding (write distribution) across the database nodes.

The have the following distribution characteristics:

Load balancers: don't scale themselves but (a) require very little resource because are very simple, (b) can scale linearly using DNS round robin.
API nodes: elasticsearch scales Lucene reads and writes "horizontally" (in the above diagram), Hadoop also scales horizontally, the harvesters scale by partitioning the sources (via the scalable database), and the per-query processing scales by virtue of the load balancer.
DB nodes: MongoDB writes scale horizontally (shards), and reads scale vertically (replicas). Each API node has a complete list of (horizontal) shards and (vertical) replicas (cached from the config server) so there are no bottlenecks.
Config server nodes: These nodes store primary key ranges and regularly update the caches in the API and DB nodes, so the resource requirements remain approximately constant.

Licensing of different components

For reference, the top level data model:

The following diagram shows the different collections (/tables) used in the database and real-time index.

The different JSON objects used in the above data model are described here.

Of note are the 4 different object types that describe the input and output of typical Infinit.e analyses: documents, features, records and custom tables.

The most fundamental type is the document object, where unstructured data of all types are fused into a unified format ("generic data model") consisting of entities and associations (as well as the retaining the source specific metadata).
The entities and associations mentioned above are aggregated into feature tables. For a quick aggregated look at the data in a given community.
For more numeric/simple data types, the record object can be used - it is much more efficient and has more flexible visualization options but is more limited for more document-centric analyses.
Finally, the results of custom plugins are saved as arbitrary JSON in custom tables. These tables can be used in custom plugin pipelines, to generate documents or records, or simply be used directly in subsequent user analysis.

There's a more detailed view of the different data storage layers here.