Cluster configuration to optimize performance

The Infinit.e platform has been designed to be easy to install and configure, run with decent performance on commodity hardware without custom configuration, and expand by "scaling horizonally" (ie adding more compute nodes).

This section describes steps that can be taken to squeeze the most performance out of a cluster (at the expense of a more complex configuration).

Hardware

In this page it is assumed that the user is running more powerful machines - for example 12 cores, 64GB, with 1 or 2 fast RAID volumes (see below). It is also assumed that the (Elasticsearch) real-time index and the (MongoDB) data store are on different nodes.

The configuration suggested below assumes at least this - where more CPU/memory would affect the suggested configuration this is noted.

Disk Configuration

(This section focuses on magnetic disks, SSD is briefly mentioned at the bottom)

Typically each node has 2 IO channels that require performance:

  • the HDFS directories (if Hadoop is being used for bulk processing)
  • Either:
    • The Elasticsearch directories
    • or
    • The MongoDB directories

Infinit.e has two default ways in which it uses directory names to decide where to put data directories:

  • if a directory called "/dbarray" exists, then the MongoDB directories are placed there
  • if a directory called "/raidarray" exists then the Elasticsearch directories are placed there (as well as the MongoDB directories if "/dbarray" does not exist)
    • The Hadoop install steps prompts the user to select the directories to use - this directory should be used as the root if it exists
  • Otherwise "/mnt" is used

Elasticsearch and HDFS and MongoDB have different recommended settings: 

  • In both case either RAID-0 or RAID-10 should be used (RAID-10 is safest, although to maximize speed RAID-0 can be used since Elasticsearch, HDFS, and MongoDB all have redundancy built-in at the node level)
  • MongoDB (/dbarray): 
    • From my configuration files (block size is important to random access performance)
      • blockdev --setra 32 /dev/xvdp; echo 'ACTION=="add", KERNEL=="xvdp", ATTR{bdi/read_ahead_kb}="16"' > /etc/udev/rules.d/85-db.rules;
      • echo '/dev/xvdp  /dbarray      ext4    defaults,noatime        0 2' >> /etc/fstab; mount -a;
  • Elasticsearch (/raidarray):
    • From my RAID configuration script (note you can't can't set noatime on the root partition)
      • (default blocksize/readahead etc is fine)
      • echo "/dev/data_vg/data_vol  $RPM_INSTALL_PREFIX      ${EXT}    defaults,noatime        0 2" >> /etc/fstab
  • (If only one RAID volume is available for both HDFS and MongoDB then go with the MongoDB settings since they are likely to be the dominating factor in performance)

We have not tested Infinit.e using SSD, though both MongoDB and Elasticsearch have been used. The general approach to utilizing SSD is:

  • If you have enough SSD then use it as the /raidarray or /dbarray
  • If not then set it up as an additional cache in between memory and disk

Java version

Currently we are tested against Oracle's JDK6 and JDK7. Oracle JDK8 testing is ongoing. Once JDK8 is tested it is expected to be significantly faster, for at least two reasons:

  • It has a new JS engine that is faster than Rhino (needs some changes to our platform, which are ongoing)
  • Better GC algorithms (will require some changes to the Elasticsearch configuration)

For now the recommended version is the latest Oracle JDK7.

Virtual Memory

It is recommended that there be SWAP space equal to at least 10GB - probably 20GB for 60GB of RAM.

Configuration file settings

(Relative to the central configuration file described here):

  • Increase the performance of the harvester at the expense of memory usage (harvest memory available setting is changed under "Post Install Settings" below):
    • "harvest.threads" (default: "file:5,database:5,feed:20") - set to "file:5,database:1,feed:10" for 8 cores, "file:10" for 16 cores etc.
    • "harvest.maxdocs_persource" (default "1000") - set to 2500 
    • "harvest.aggregation.duty_cycle" - set to 0.25 (slower entity aggregation allowing faster bulk import)
  • Data store limit:
    • "db.capacity" (default "100000000") - set to 500000000 (ie 500M) to be on the safe side
  • API configuration:
    • "api.aggregation.accuracy" - ensure set to "low"
  • Elasticsearch configuration:
    • elastic.search.min_peers - set to 2 (provides better split brain protection)

Other configuration notes:

  • It is now possible to configure MongoDB clusters using hostnames instead of IP addresses:
    • if "db.config.servers" is left blank, then the short hostname is used (eg "node-api-1-1.ikanow.com" -> "node-api-1-1")
    • if "db.config.servers" is set to "FQDN" then the full hostname is used
  • Where possible the config nodes should be separate machines, and 3 should always be used. If there are no additional nodes available, then they should be put on the API nodes.

Configuration - database configuration

One "trick" MongoDB have suggested in the past to speed up write performance is to run multiple instances of MongoDB shards on a single node (this comes at the price of a slight degradation in read performance for reads that aren't targeted based on their shard key).

We are currently increasing the ways in which we will advantage of this performance-wise - and shards can be added in this way dynamically in any case.

For the moment it is recommended to add 2 shards to each node. This can be done very easily from the "db.replica.sets" parameter .. eg if you have 4 hosts with hostnames A, B, C and D, then the configuration would look like:

  • db.replica.sets=A,B;A,B;C,D;C,D

(ie (A,B) and (C,D) pairwise contain the same data, and (A,B) hosts data from shards 1 and 2 and (C,D) hosts data from shards 3 and 4)

RPM to node distribution

Assuming Nx API nodes and Mx DB nodes, the "standard" mapping is:

  • Both:
    • infinit.e-platform.prerequisites* (online or offline)
    • infinit.e-hadoop* (online or offline), infinit.e-config
  • API nodes:
    • infinit.e-index-engine, infinit.e-processing-engine, infinit.e-interface-engine
  • DB nodes:
    • infinit.e-db-instance
    • (from Nov 2014): infinit.e-index-interface

To maximize the ingest performance, you can also install the infinit.e-processing-engine RPM on the DB nodes. This doubles the number of harvesters. Note that it is necessary to copy any additional JARS into the DB nodes' plugins/extractors/unbundled directories (see here), just like for the API nodes.

Hadoop configuration

In the Hadoop configuration guide, the following settings are recommended:

  • mappers ("Maximum Number of Simultaneous Map Tasks"): 2
  • reducers ("Maximum Number of Simultaneous Reduce Tasks"): 1

For an 8 core system this should be set to 4 and 2. (For a 16 core system, 8 and 4, etc).

Post install configuration

This section describes changes that are made to the processes' configuration files after RPMs have been installed.

The files described in here are designated as RPM configuration files, meaning that they will only be overwritten if the RPM-side version of the file is updated (in which case the "old" version is saved to "<filename>.RPMSAVE". Care must therefore be taken while updating RPMs to note if this happens (and then the user must merge the files by hand if so)

  • /etc/sysconfig/tomcat6-interface-engine:
    • Find the line:
      • JAVA_OPTS="$JAVA_OPTS -Xms1024m -Xmx1024m -Xmn256m" && [[ `cat /proc/meminfo | grep MemTotal | gawk '{ print $2 }
        ' | grep -P "[0-9]{8,}"` ]] && JAVA_OPTS="$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn512m"

    • Change the second JAVA_OPTS clause to (changes shown in bold)

      • "$JAVA_OPTS -Xms4096m -Xmx4096m -Xmn1024m"
        • (Note don't scale this with additional memory - 4GB should be sufficient)
    • (After changing restart the corresponding service with: "service tomcat6-interface-engine restart")
  • /etc/sysconfig/infinite-index-engine:
    • Find the line:
      • export JAVA_OPTS="-Xms2048m -Xmx2048m -Xmn512m" && [[ `cat /proc/meminfo | grep MemTotal | gawk '{ print $2 }' |
        grep -P "[0-9]{8,}"` ]] && \
        JAVA_OPTS="-Xms7656m -Xmx7656m -Xmn2048m"

    • Change the second JAVA_OPTS clause to (changes shown in bold):
      • JAVA_OPTS="-Xms25g -Xmx25g -Xmn5g"
        • (this is for 60GB - do scale linearly with memory)
    • (After changing restart the corresponding service with: "service infinite-index-engine restart")
  • /opt/infinite-home/bin/infinite-px-engine.sh
    • Find the line:
      • EXTRA_JAVA_ARGS="$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn512m -Dfile.encoding=UTF-8 [...]
    • Change the start to (changes shown in bold):
      • EXTRA_JAVA_ARGS="$JAVA_OPTS -Xms10g -Xmx10g -Xmn2048m -Dfile.encoding=UTF-8 [...] 
        • (Don't scale this with memory - ideally you want ~2GB per file thread (assuming 2.5K docs/cycle), but not more than total_memory - 2*elasticsearch_memory!)
    • (After changing restart the corresponding service with: "service infinite-px-engine restart")

Shared filesystem configuration

Currently we do not take advantage of HDFS for file extraction - this is coming soon.

In the meantime to provide a shared filesystem, there are a few options:

  • Set up a Samba share on one of the servers (or ideally a separate fileserver), use the file extractor NetBIOS interface
  • Set up an NFS share on one of the servers (or ideally a separate fileserver), mount on each of the harvest nodes, use the file extractor local file interface
  • Use FUSE on each harvest node to provide a regular filesystem interface to HDFS (this is unproven, the one time I tried - in a slightly uncontrolled environment - FUSE stopped working after a day or so)

It is not known which is best from a performance standpoint - the second (NFS) is recommended for now (Samba would be preferred but see below).

There is currently an issue with multi-threading in the NetBIOS interface - as a result only one thread can perform all the File operations (including slow bulk operations like de-duplication), and this makes the Samba method very low performance if multi-threaded. For the moment, the Samba method is not recommended.

UPDATE (11 Nov): there is a fix in the trunk that will be pushed out to the Nov 2014 release. With the fix implemented, the Samba share method is preferred again)

Source JSON configuration

Extractor type

The file extractor is the most optimized one, so wherever possible that should be used.

Deduplication

The fastest configuration for reading in files is as follows:

  • set "file.mode" to "streaming"
  • set "file.renameAfterParse" to "."

This will delete the files in the input directory as they are processed. If you want to preserve the files, they should therefore be copied into the input directory.

(One alternative is to have an "archive" sub-directory of each input directory and then set "file.renameAfterParse" to "$path/archive/$name" and then set "file.pathExclude" to ".*/archive/.*")

Threading

For a single bulk ingest, the "harvest.distributionFactor" should be set to 80 .. this corresponds to:

  • 8 nodes x 5 file threads x 2 (duty cycle)

If you are expecting to be ingesting multiple sources at the same time then scale down the distributionFactor accordingly.

Feature Extractor performance

Note that the limiting factor on performance will often be the NLP processing that needs to occur. For example Salience will run at ~2 docs per second per thread on "average" sized documents (so at unrealistic 100% duty cycle on an 8 harvest node cluster with 5 files threads that would give you 80 docs/second, or about 250K/hour). Some NLTK-based extractors are even slower.

Copyright © 2012 IKANOW, All Rights Reserved | Licensed under Creative Commons