...
Code Block |
---|
#------------------------------------------------------------------------------- # 2.11] Harvester Properties #------------------------------------------------------------------------------- # Comma-separated-list from File,Database,Feed (note Database and Feed need jars not bundled with the RPM) harvester.types=File,Database,Feed # Web crawling etiquette: the time to way between consecutive accesses to the same time (10s is standard) harvest.feed.wait=10000 # The minimum time between consecutive harvests (avoids thrashing FS/DB/RSS when there's nothing to get) harvest.mintime.ms=300000 # The minimum time between consecutive source harvests (set if needs to be longer than harvest.mintime.ms, # eg if you want to pick up a source quickly the first time but then not update so frequently) harvest.source.mintime.ms= # Restricts the number of docs that can be harvested per cycle for memory reasons: harvest.maxdocs_persource=5000 # Threading configuration type:num_threads (type from above): # (eg for RSS heavy increase the "feed", for DB heavy increase the "file" etc. Beyond 20 there is limited benefit). harvest.threads=file:5,database:5,feed:20 # This controls the batch size of sources picked up by a thread, this does not normally need to be changed (its default is shown) # (It can be reduced in cases where a small number of very long-running sources need to be harvested). #harvest.distribution.batch.harvest=20 # This disables entity and association aggregation. For almost all applications you will not want to set this. #harvest.disable_aggregation=false # This parameter uses the Java Security Managercontrols what % of 1 CPU is used to preventupdate scriptsentity accessingand localassociation networkcounts servicesand (at the expense of some performance)synchronization, shouldn't need to change it # It(Reducing canit bewill turnedspeed offup forraw usesharvest ofspeeds, theIncreasing platformit wherewill sourceskeep mustentity beetc approvedfreqs before being added (etc) harvest.security=more up-to-date) #harvest.aggregation.duty_cycle=0.5 # This parameter uses the Java Security Manager to prevent scripts accessing local network services (at the expense of some performance) # It can be turned off for uses of the platform where sources must be approved before being added (etc) harvest.security=false # This is a comma-separated list of hosts in the following format "http://<HOST>[:<PORT>]" or "socks://<HOST>:<PORT>" # When specified, all requests for external content from the harvester are proxied (round-robin) through the specified hosts harvest.proxy= # Content controls: #This is the maximum size of content (before gzip) that will be stored (truncated above this) #store.maxcontent=16000000 #If true (default false), stores the raw content of a document (as well as the post-processed text) #store.rawcontent=false #If true (default false), then the doc_content.gzip_content also contains the JSON metadata, stored as a string #store.metadata_as_content=false |
2.12 Hadoop Properties
The Hadoop config path is a local folder where Infinit.e stores map reduce jobs if Hadoop is used.
...