...
- It has a new JS engine that is faster than Rhino (needs some changes to our harvester that platform, which are ongoing)
- Better GC algorithms (will require some changes to the Elasticsearch configuration)
...
(Relative to the central configuration file described here):
- Increase the performance of the harvester at the expense of memory usage (harvest memory available setting is changed under "Post Install Settings" below):
- "harvest.threads" (default: "file:5,database:5,feed:20") - set to "file:5,database:1,feed:10" for 8 cores, "file:10" for 16 cores etc.
- "harvest.maxdocs_persource" (default "1000") - set to 2500
- "harvest.aggregation.duty_cycle" - set to 0.25 (slower entity aggregation allowing faster bulk import)
- Data store limit:
- "db.capacity" (default "100000000") - set to 500000000 (ie 500M) to be on the safe side
- API configuration:
- "api.aggregation.accuracy" - ensure set to "low"
- Elasticsearch configuration:
- elastic.search.min_peers - set to 2 (provides better split brain protection)
Other configuration notes:
- It is now possible to configure MongoDB clusters using hostnames instead of IP addresses:
- if "db.config.servers" is left blank, then the short hostname is used (eg "node-api-1-1.ikanow.com" -> "node-api-1-1")
- if "db.config.servers" is set to "FQDN" then the full hostname is used
- Where possible the config nodes should be separate machines, and 3 should always be used. If there are no additional nodes available, then they should be put on the API nodes.
Configuration - database configuration
One "trick" MongoDB have suggested in the past to speed up write performance is to run multiple instances of MongoDB shards on a single node (this comes at the price of a slight degradation in read performance for reads that aren't targeted based on their shard key):
...
.
We are currently increasing the ways in which we will advantage of this performance-wise - and shards can be added in this way dynamically in any case.
For the moment it is recommended to add 2 shards to each node. This can be done very easily from the "db.replica.sets" parameter .. eg if you have 4 hosts with hostnames A, B, C and D, then the configuration would look like:
- db.replica.sets=A,B;A,B;C,D;C,D
(ie (A,B) and (C,D) pairwise contain the same data, and (A,B) hosts data from shards 1 and 2 and (C,D) hosts data from shards 3 and 4)
RPM to node distribution
Assuming Nx API nodes and Mx DB nodes, the "standard" mapping is:
...
Hadoop configuration
In the Hadoop configuration guide, the following settings are recommended:
- mappers (XXX"Maximum Number of Simultaneous Map Tasks"): XXXX2
- reducers (XXX"Maximum Number of Simultaneous Reduce Tasks"): XXX
...
- 1
For an 8 core system this should be set to 4 and 2. (For a 16 core system, 8 and 4, etc).
Post install configuration
...
- /etc/sysconfig/tomcat6-interface-engine:
- XXX
- (After changing restart the corresponding service with: XXX)
- /etc/sysconfig/infinite-index-engine:
- XXX
- (After changing restart the corresponding service with: XXX)
- /opt/infinite-home/bin/infinite-px-engine.sh
- XXX
- (After changing restart the corresponding service with: XXX)
Shared filesystem configuration
TODO
Source JSON configuration
...