Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
#-------------------------------------------------------------------------------
# 2.11] Harvester Properties
#-------------------------------------------------------------------------------
# Comma-separated-list from File,Database,Feed (note Database and Feed need jars not bundled with the RPM)
harvester.types=File,Database,Feed
# Web crawling etiquette: the time to way between consecutive accesses to the same time (10s is standard)
harvest.feed.wait=10000
# The minimum time between consecutive harvests (avoids thrashing FS/DB/RSS when there's nothing to get)
harvest.mintime.ms=300000
# The minimum time between consecutive source harvests (set if needs to be longer than harvest.mintime.ms,
# eg if you want to pick up a source quickly the first time but then not update so frequently)
harvest.source.mintime.ms=
# Restricts the number of docs that can be harvested per cycle for memory reasons:
harvest.maxdocs_persource=5000
# Threading configuration type:num_threads (type from above):
# (eg for RSS heavy increase the "feed", for DB heavy increase the "file" etc. Beyond 20 there is limited benefit). 
harvest.threads=file:5,database:5,feed:20
# This controls the batch size of sources picked up by a thread, this does not normally need to be changed (its default is shown)
# (It can be reduced in cases where a small number of very long-running sources need to be harvested).
#harvest.distribution.batch.harvest=20
# This disables entity and association aggregation. For almost all applications you will not want to set this.
#harvest.disable_aggregation=false
# This parameter uses the Java Security Managercontrols what % of 1 CPU is used to preventupdate scriptsentity accessingand localassociation networkcounts servicesand (at the expense of some performance)synchronization, shouldn't need to change it
# It(Reducing canit bewill turnedspeed offup forraw usesharvest ofspeeds, theIncreasing platformit wherewill sourceskeep mustentity beetc approvedfreqs before being added (etc)
harvest.security=more up-to-date)
#harvest.aggregation.duty_cycle=0.5
# This parameter uses the Java Security Manager to prevent scripts accessing local network services (at the expense of some performance)
# It can be turned off for uses of the platform where sources must be approved before being added (etc)
harvest.security=false
# This is a comma-separated list of hosts in the following format "http://<HOST>[:<PORT>]" or "socks://<HOST>:<PORT>"
# When specified, all requests for external content from the harvester are proxied (round-robin) through the specified hosts
harvest.proxy=
# Content controls:
#This is the maximum size of content (before gzip) that will be stored (truncated above this)
#store.maxcontent=16000000
#If true (default false), stores the raw content of a document (as well as the post-processed text)
#store.rawcontent=false
#If true (default false), then the doc_content.gzip_content also contains the JSON metadata, stored as a string 
#store.metadata_as_content=false
2.12 Hadoop Properties

The Hadoop config path is a local folder where Infinit.e stores map reduce jobs if Hadoop is used.

...