Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • /etc/sysconfig/tomcat6-interface-engine:
    • XXXFind the line:
      • JAVA_OPTS="$JAVA_OPTS -Xms1024m -Xmx1024m -Xmn256m" && [[ `cat /proc/meminfo | grep MemTotal | gawk '{ print $2 }
        ' | grep -P "[0-9]{8,}"` ]] && JAVA_OPTS="$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn512m"

    • Change the second JAVA_OPTS clause to (changes shown in bold)

      • "$JAVA_OPTS -Xms4096m -Xmx4096m -Xmn1024m"
        • (Note don't scale this with additional memory - 4GB should be sufficient)
    • (After changing restart the corresponding service with: XXX"service tomcat6-interface-engine restart")
  • /etc/sysconfig/infinite-index-engine:
    • Find the line:
      • export JAVA_OPTS="-Xms2048m -Xmx2048m -Xmn512m" && [[ `cat /proc/meminfo | grep MemTotal | gawk '{ print $2 }' |
        grep -P "[0-9]{8,}"` ]] && \
        JAVA_OPTS="-Xms7656m -Xmx7656m -Xmn2048m"

    • Change the second JAVA_OPTS clause to (changes shown in bold):
    • XXX
      • JAVA_OPTS="-Xms25g -Xmx25g -Xmn5g"
        • (this is for 60GB - do scale linearly with memory)
    • (After changing restart the corresponding service with: XXX"service infinite-index-engine restart")
  • /opt/infinite-home/bin/infinite-px-engine.sh
      XXX
    • Find the line:
      • EXTRA_JAVA_ARGS="$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn512m -Dfile.encoding=UTF-8 [...]
    • Change the start to (changes shown in bold):
      • EXTRA_JAVA_ARGS="$JAVA_OPTS -Xms10g -Xmx10g -Xmn2048m -Dfile.encoding=UTF-8 [...] 
        • (Don't scale this with memory - ideally you want ~2GB per file thread (assuming 2.5K docs/cycle), but not more than total_memory - 2*elasticsearch_memory!)
    • (After changing restart the corresponding service with: XXX"service infinite-px-engine restart")

Shared filesystem configuration

...

Currently we do not take advantage of HDFS for file extraction - this is coming soon.

In the meantime to provide a shared filesystem, there are a few options:

  • Set up a Samba share on one of the servers (or ideally a separate fileserver), use the file extractor NetBIOS interface
  • Set up an NFS share on one of the servers (or ideally a separate fileserver), mount on each of the harvest nodes, use the file extractor local file interface
  • Use FUSE on each harvest node to provide a regular filesystem interface to HDFS (this is unproven, the one time I tried - in a slightly uncontrolled environment - FUSE stopped working after a day or so)

It is not known which is best from a performance standpoint - the second (NFS) is recommended for now (Samba would be preferred but see below).

Warning

There is currently an issue with multi-threading in the NetBIOS interface - as a result only one thread can perform all the File operations (including slow bulk operations like de-duplication), and this makes the Samba method very low performance if multi-threaded. For the moment, the Samba method is not recommended.

UPDATE (11 Nov): there is a fix in the trunk that will be pushed out to the Nov 2014 release. With the fix implemented, the Samba share method is preferred again)

Source JSON configuration

Extractor type

The file extractor is the most optimized one, so wherever possible that should be used.

Deduplication

The fastest configuration for reading in files is as follows:

  • set "file.

...

XXX

XXX roadmap

 

 

 

...

  • mode" to "streaming"
  • set "file.renameAfterParse" to "."

Warning

This will delete the files in the input directory as they are processed. If you want to preserve the files, they should therefore be copied into the input directory.

(One alternative is to have an "archive" sub-directory of each input directory and then set "file.renameAfterParse" to "$path/archive/$name" and then set "file.pathExclude" to ".*/archive/.*")

Threading

For a single bulk ingest, the "harvest.distributionFactor" should be set to 80 .. this corresponds to:

  • 8 nodes x 5 file threads x 2 (duty cycle)

If you are expecting to be ingesting multiple sources at the same time then scale down the distributionFactor accordingly.

Feature Extractor performance

Note that the limiting factor on performance will often be the NLP processing that needs to occur. For example Salience will run at ~2 docs per second per thread on "average" sized documents (so at unrealistic 100% duty cycle on an 8 harvest node cluster with 5 files threads that would give you 80 docs/second, or about 250K/hour). Some NLTK-based extractors are even slower.