Page Comparison

...

It has a new JS engine that is faster than Rhino (needs some changes to our harvester that platform, which are ongoing)
Better GC algorithms (will require some changes to the Elasticsearch configuration)

...

(Relative to the central configuration file described here):

Increase the performance of the harvester at the expense of memory usage (harvest memory available setting is changed under "Post Install Settings" below):
- "harvest.threads" (default: "file:5,database:5,feed:20") - set to "file:5,database:1,feed:10" for 8 cores, "file:10" for 16 cores etc.
- "harvest.maxdocs_persource" (default "1000") - set to 2500
- "harvest.aggregation.duty_cycle" - set to 0.25 (slower entity aggregation allowing faster bulk import)
Data store limit:
- "db.capacity" (default "100000000") - set to 500000000 (ie 500M) to be on the safe side
API configuration:
- "api.aggregation.accuracy" - ensure set to "low"
Elasticsearch configuration:
- elastic.search.min_peers - set to 2 (provides better split brain protection)

Other configuration notes:

It is now possible to configure MongoDB clusters using hostnames instead of IP addresses:
- if "db.config.servers" is left blank, then the short hostname is used (eg "node-api-1-1.ikanow.com" -> "node-api-1-1")
- if "db.config.servers" is set to "FQDN" then the full hostname is used
Where possible the config nodes should be separate machines, and 3 should always be used. If there are no additional nodes available, then they should be put on the API nodes.

Configuration - database configuration

One "trick" MongoDB have suggested in the past to speed up write performance is to run multiple instances of MongoDB shards on a single node (this comes at the price of a slight degradation in read performance for reads that aren't targeted based on their shard key):

...

.

We are currently increasing the ways in which we will advantage of this performance-wise - and shards can be added in this way dynamically in any case.

For the moment it is recommended to add 2 shards to each node. This can be done very easily from the "db.replica.sets" parameter .. eg if you have 4 hosts with hostnames A, B, C and D, then the configuration would look like:

db.replica.sets=A,B;A,B;C,D;C,D

(ie (A,B) and (C,D) pairwise contain the same data, and (A,B) hosts data from shards 1 and 2 and (C,D) hosts data from shards 3 and 4)

RPM to node distribution

Assuming Nx API nodes and Mx DB nodes, the "standard" mapping is:

...

Hadoop configuration

In the Hadoop configuration guide, the following settings are recommended:

mappers (XXX"Maximum Number of Simultaneous Map Tasks"): XXXX2
reducers (XXX"Maximum Number of Simultaneous Reduce Tasks"): XXX

...

1

For an 8 core system this should be set to 4 and 2. (For a 16 core system, 8 and 4, etc).

Post install configuration

...

/etc/sysconfig/tomcat6-interface-engine:
- XXX
- (After changing restart the corresponding service with: XXX)
/etc/sysconfig/infinite-index-engine:
- XXX
- (After changing restart the corresponding service with: XXX)
/opt/infinite-home/bin/infinite-px-engine.sh
- XXX
- (After changing restart the corresponding service with: XXX)

Shared filesystem configuration

TODO

Source JSON configuration

...

Versions Compared

Old Version 2

New Version 3

Key