Cluster configuration to optimize peformance

The Infinit.e platform has been designed to be easy to install and configure, run with decent performance on commodity hardware without custom configuration, and expand by "scaling horizonally" (ie adding more compute nodes).

This section describes steps that can be taken to squeeze the most performance out of a cluster (at the expense of a more complex configuration).

Hardware

In this page it is assumed that the user is running more powerful machines - for example 12 cores, 64GB, with 1 or 2 fast RAID volumes (see below). It is also assumed that the (Elasticsearch) real-time index and the (MongoDB) data store are on different nodes.

The configuration suggested below assumes at least this - where more CPU/memory would affect the suggested configuration this is noted.

Disk Configuration

(This section focuses on magnetic disks, SSD is briefly mentioned at the bottom)

Typically each node has 2 IO channels that require performance:

the HDFS directories (if Hadoop is being used for bulk processing)
Either:
- The Elasticsearch directories
- or
- The MongoDB directories

Infinit.e has two default ways in which it uses directory names to decide where to put data directories:

if a directory called "/dbarray" exists, then the MongoDB directories are placed there
if a directory called "/raidarray" exists then the Elasticsearch directories are placed there (as well as the MongoDB directories if "/dbarray" does not exist)
- The Hadoop install steps prompts the user to select the directories to use - this directory should be used as the root if it exists
Otherwise "/mnt" is used

Elasticsearch and HDFS and MongoDB have different recommended settings:

In both case either RAID-0 or RAID-10 should be used (RAID-10 is safest, although to maximize speed RAID-0 can be used since Elasticsearch, HDFS, and MongoDB all have redundancy built-in at the node level)
MongoDB (/dbarray):
- From my configuration files (block size is important to random access performance)
  - blockdev --setra 32 /dev/xvdp; echo 'ACTION=="add", KERNEL=="xvdp", ATTR{bdi/read_ahead_kb}="16"' > /etc/udev/rules.d/85-db.rules;
  - echo '/dev/xvdp /dbarray ext4 defaults,noatime 0 2' >> /etc/fstab; mount -a;
Elasticsearch (/raidarray):
- From my RAID configuration script (note you can't can't set noatime on the root partition)
  - (default blocksize/readahead etc is fine)
  - echo "/dev/data_vg/data_vol $RPM_INSTALL_PREFIX ${EXT} defaults,noatime 0 2" >> /etc/fstab
(If only one RAID volume is available for both HDFS and MongoDB then go with the MongoDB settings since they are likely to be the dominating factor in performance)

We have not tested Infinit.e using SSD, though both MongoDB and Elasticsearch have been used. The general approach to utilizing SSD is:

If you have enough SSD then use it as the /raidarray or /dbarray
If not then set it up as an additional cache in between memory and disk

Java version

Currently we are tested against Oracle's JDK6 and JDK7. Oracle JDK8 testing is ongoing. Once JDK8 is tested it is expected to be significantly faster, for at least two reasons:

It has a new JS engine that is faster than Rhino (needs some changes to our harvester that are ongoing)
Better GC algorithms (will require some changes to the Elasticsearch configuration)

For now the recommended version is the latest Oracle JDK7.

Virtual Memory

It is recommended that there be SWAP space equal to at least 10GB - probably 20GB for 60GB of RAM.

Configuration file settings

(Relative to the central configuration file described here):

TODO

RPM to node distribution

Assuming Nx API nodes and Mx DB nodes, the "standard" mapping is:

Both:
- infinit.e-platform.prerequisites* (online or offline)
- infinit.e-hadoop* (online or offline), infinit.e-config
API nodes:
- infinit.e-index-engine, infinit.e-processing-engine, infinit.e-interface-engine
DB nodes:
- infinit.e-db-instance
- (from Nov 2014): infinit.e-index-interface

To maximize the ingest performance, you can also install the infinit.e-processing-engine RPM on the DB nodes. This doubles the number of harvesters. Note that it is necessary to copy any additional JARS into the DB nodes' plugins/extractors/unbundled directories (see here), just like for the API nodes.

Post install configuration

(TODO: note these are RPM %config files meaning that yada yada)

Source JSON configuration

XXX