Hadoop configuration files

Hadoop configuration is taken from a number of places. This page is intended to provide a quick guide.

The default configuration is taken from the hadoop-core JAR, ie cannot be changed
When a job is created from the custom API/GUI, the following parameters are overridden:
- pre November 2014:
  - Where to look for the jobtracker and FS ("mapred.job.tracker", "mapred.job.tracker", "fs.default.name")
    - These are taken from the "*-site.xml" files found in "hadoop.configpath" (in the "/hadoop" subdir)
  - all the per-job parameters (Infinit.e configuration, MongoDB configuration, mapper classes etc)
- November 2014 onwards:
  - All settings from the "*-site.xml" files found in "hadoop.configpath" (in the "/hadoop" subdir) override the defaults
Many of these configuration parameters are overridden by the settings maintained in the Cloudera Manager
- It isn't currently clear which, it should probably be assumed that if a configuration parameter controls the environment in which the job runs, rather than the job itself, then it will be overridden by Cloudera Manager
- (Note that the Cloudera Manager configuration for a given service on each node lives in a subdirectory of "/var/run/cloudera-scm-agent/process/" - which sub-directory can be found by getting the process id ("ps -ef"), then the working directory of that process ("pwdx $HADOOP_PID") ... the current configurations can also be viewed from the Cloudera Manager GUI)
Note as a result of the above, it is not necessary to redistribute the client configuration when an "environmental" setting is changed.

The configuration files in "/usr/lib/hadoop/conf" are not used at all by Cloudera or Infinit.e, however they are used by the command line (for example "hadoop fs -ls"). Before using any command-line utils, it is therefore recommended to copy the client configuration into "/usr/lib/hadoop/conf"