Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • It has a new JS engine that is faster than Rhino (needs some changes to our harvester that platform, which are ongoing)
  • Better GC algorithms (will require some changes to the Elasticsearch configuration)

...

(Relative to the central configuration file described here):

  • Increase the performance of the harvester at the expense of memory usage (harvest memory available setting is changed under "Post Install Settings" below):
    • "harvest.threads" (default: "file:5,database:5,feed:20") - set to "file:5,database:1,feed:10" for 8 cores, "file:10" for 16 cores etc.
    • "harvest.maxdocs_persource" (default "1000") - set to 2500 
    • "harvest.aggregation.duty_cycle" - set to 0.25 (slower entity aggregation allowing faster bulk import)
  • Data store limit:
    • "db.capacity" (default "100000000") - set to 500000000 (ie 500M) to be on the safe side
  • API configuration:
    • "api.aggregation.accuracy" - ensure set to "low"
  • Elasticsearch configuration:
    • elastic.search.min_peers - set to 2 (provides better split brain protection)

Other configuration notes:

  • It is now possible to configure MongoDB clusters using hostnames instead of IP addresses:
    • if "db.config.servers" is left blank, then the short hostname is used (eg "node-api-1-1.ikanow.com" -> "node-api-1-1")
    • if "db.config.servers" is set to "FQDN" then the full hostname is used
  • Where possible the config nodes should be separate machines, and 3 should always be used. If there are no additional nodes available, then they should be put on the API nodes.

Configuration - database configuration

One "trick" MongoDB have suggested in the past to speed up write performance is to run multiple instances of MongoDB shards on a single node (this comes at the price of a slight degradation in read performance for reads that aren't targeted based on their shard key):

...

.

We are currently increasing the ways in which we will advantage of this performance-wise - and shards can be added in this way dynamically in any case.

For the moment it is recommended to add 2 shards to each node. This can be done very easily from the "db.replica.sets" parameter .. eg if you have 4 hosts with hostnames A, B, C and D, then the configuration would look like:

  • db.replica.sets=A,B;A,B;C,D;C,D

(ie (A,B) and (C,D) pairwise contain the same data, and (A,B) hosts data from shards 1 and 2 and (C,D) hosts data from shards 3 and 4)

RPM to node distribution

Assuming Nx API nodes and Mx DB nodes, the "standard" mapping is:

...

Hadoop configuration

In the Hadoop configuration guide, the following settings are recommended:

  • mappers (XXX"Maximum Number of Simultaneous Map Tasks"): XXXX2
  • reducers (XXX"Maximum Number of Simultaneous Reduce Tasks"): XXX

...

  • 1

For an 8 core system this should be set to 4 and 2. (For a 16 core system, 8 and 4, etc).

Post install configuration

...

  • /etc/sysconfig/tomcat6-interface-engine:
    • XXX
    • (After changing restart the corresponding service with: XXX)
  • /etc/sysconfig/infinite-index-engine:
    • XXX
    • (After changing restart the corresponding service with: XXX)
  • /opt/infinite-home/bin/infinite-px-engine.sh
    • XXX
    • (After changing restart the corresponding service with: XXX)

Shared filesystem configuration

TODO

Source JSON configuration

...