Page Comparison

...

It has a new JS engine that is faster than Rhino (needs some changes to our harvester that platform, which are ongoing)
Better GC algorithms (will require some changes to the Elasticsearch configuration)

...

(Relative to the central configuration file described here):

Increase the performance of the harvester at the expense of memory usage (harvest memory available setting is changed under "Post Install Settings" below):
- "harvest.threads" (default: "file:5,database:5,feed:20") - set to "file:5,database:1,feed:10" for 8 cores, "file:10" for 16 cores etc.
- "harvest.maxdocs_persource" (default "1000") - set to 2500
- "harvest.aggregation.duty_cycle" - set to 0.25 (slower entity aggregation allowing faster bulk import)
Data store limit:
- "db.capacity" (default "100000000") - set to 500000000 (ie 500M) to be on the safe side
API configuration:
- "api.aggregation.accuracy" - ensure set to "low"
Elasticsearch configuration:
- elastic.search.min_peers - set to 2 (provides better split brain protection)

Other configuration notes:

It is now possible to configure MongoDB clusters using hostnames instead of IP addresses:

...

TODO
- if "db.config.servers" is left blank, then the short hostname is used (eg "node-api-1-1.ikanow.com" -> "node-api-1-1")
- if "db.config.servers" is set to "FQDN" then the full hostname is used
Where possible the config nodes should be separate machines, and 3 should always be used. If there are no additional nodes available, then they should be put on the API nodes.

Configuration - database configuration

One "trick" MongoDB have suggested in the past to speed up write performance is to run multiple instances of MongoDB shards on a single node (this comes at the price of a slight degradation in read performance for reads that aren't targeted based on their shard key).

We are currently increasing the ways in which we will advantage of this performance-wise - and shards can be added in this way dynamically in any case.

For the moment it is recommended to add 2 shards to each node. This can be done very easily from the "db.replica.sets" parameter .. eg if you have 4 hosts with hostnames A, B, C and D, then the configuration would look like:

db.replica.sets=A,B;A,B;C,D;C,D

(ie (A,B) and (C,D) pairwise contain the same data, and (A,B) hosts data from shards 1 and 2 and (C,D) hosts data from shards 3 and 4)

RPM to node distribution

Assuming Nx API nodes and Mx DB nodes, the "standard" mapping is:

...

Hadoop configuration

In the Hadoop configuration guide, the following settings are recommended:

mappers (XXX"Maximum Number of Simultaneous Map Tasks"): XXXX2
reducers (XXX"Maximum Number of Simultaneous Reduce Tasks"): XXX

...

1

For an 8 core system this should be set to 4 and 2. (For a 16 core system, 8 and 4, etc).

Post install configuration

...

/etc/sysconfig/tomcat6-interface-engine:
- Find the line:
- XXX
  - JAVA_OPTS="$JAVA_OPTS -Xms1024m -Xmx1024m -Xmn256m" && [[ `cat /proc/meminfo | grep MemTotal | gawk '{ print $2 }
    ' | grep -P "[0-9]{8,}"` ]] && JAVA_OPTS="$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn512m"
- Change the second JAVA_OPTS clause to (changes shown in bold)
  - "$JAVA_OPTS -Xms4096m -Xmx4096m -Xmn1024m"
    - (Note don't scale this with additional memory - 4GB should be sufficient)
- (After changing restart the corresponding service with: XXX"service tomcat6-interface-engine restart")
/etc/sysconfig/infinite-index-engine:XXX:
- Find the line:
  - export JAVA_OPTS="-Xms2048m -Xmx2048m -Xmn512m" && [[ `cat /proc/meminfo | grep MemTotal | gawk '{ print $2 }' |
    grep -P "[0-9]{8,}"` ]] && \
    JAVA_OPTS="-Xms7656m -Xmx7656m -Xmn2048m"
- Change the second JAVA_OPTS clause to (changes shown in bold):
  - JAVA_OPTS="-Xms25g -Xmx25g -Xmn5g"
    - (this is for 60GB - do scale linearly with memory)
- (After changing restart the corresponding service with: XXX"service infinite-index-engine restart")
/opt/infinite-home/bin/infinite-px-engine.sh
- Find the line:
  - EXTRA_JAVA_ARGS="$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn512m -Dfile.encoding=UTF-8 [...]
- Change the start to (changes shown in bold):
  - EXTRA_JAVA_ARGS="$JAVA_OPTS -Xms10g -Xmx10g -Xmn2048m -Dfile.encoding=UTF-8 [...]
    - (Don't scale this with memory - ideally you want ~2GB per file thread (assuming 2.5K docs/cycle), but not more than total_memory - 2*elasticsearch_memory!)
- (After changing restart the corresponding service with: "service infinite-px-engine restart")

Shared filesystem configuration

Currently we do not take advantage of HDFS for file extraction - this is coming soon.

In the meantime to provide a shared filesystem, there are a few options:

Set up a Samba share on one of the servers (or ideally a separate fileserver), use the file extractor NetBIOS interface
Set up an NFS share on one of the servers (or ideally a separate fileserver), mount on each of the harvest nodes, use the file extractor local file interface
Use FUSE on each harvest node to provide a regular filesystem interface to HDFS (this is unproven, the one time I tried - in a slightly uncontrolled environment - FUSE stopped working after a day or so)

It is not known which is best from a performance standpoint - the second (NFS) is recommended for now (Samba would be preferred but see below).

Warning
There is currently an issue with

...

multi-threading in the NetBIOS interface - as a result only one thread can perform all the File operations (including slow bulk operations like de-duplication), and this makes the Samba method very low performance if multi-threaded. For the moment, the Samba method is not recommended.

UPDATE (11 Nov): there is a fix in the trunk that will be pushed out to the Nov 2014 release. With the fix implemented, the Samba share method is preferred again)

Source JSON configuration

Extractor type

The file extractor is the most optimized one, so wherever possible that should be used.

Deduplication

The fastest configuration for reading in files is as follows:

set "file.

...

XXX

XXX roadmap

...

mode" to "streaming"
set "file.renameAfterParse" to "."

Warning

This will delete the files in the input directory as they are processed. If you want to preserve the files, they should therefore be copied into the input directory.

(One alternative is to have an "archive" sub-directory of each input directory and then set "file.renameAfterParse" to "$path/archive/$name" and then set "file.pathExclude" to ".*/archive/.*")

Threading

For a single bulk ingest, the "harvest.distributionFactor" should be set to 80 .. this corresponds to:

8 nodes x 5 file threads x 2 (duty cycle)

If you are expecting to be ingesting multiple sources at the same time then scale down the distributionFactor accordingly.

Feature Extractor performance

Note that the limiting factor on performance will often be the NLP processing that needs to occur. For example Salience will run at ~2 docs per second per thread on "average" sized documents (so at unrealistic 100% duty cycle on an 8 harvest node cluster with 5 files threads that would give you 80 docs/second, or about 250K/hour). Some NLTK-based extractors are even slower.

Versions Compared

Old Version 2

New Version Current

Key