Configure Hadoop Machines
1 Create hadoop user. Although you don’t have to create a hadoop user, but this is a good practice.
# sudo addgroup hadoop
# sudo adduser –ingroup hadoop hadoop
# sudo adduser hadoop admin
Now you can longin by
# su – hadoop
You need to create the hadoop user on each machines in your cluster.
2 To make you easy to reference the remote machines, edit the file /etc/hosts, so that the remote machines can be recognized by alias. For example you have two machines with ip 192.168.0.1 and 192.168.0.2, you want to rename them to master and worker1.
192.168.0.1 {tab} master
192.168.0.2 {tab} worker1
3 Configure ssh to login without password. First , you need to generate an SSH key.
# ssh-keygen -t rsa -P “”
Make the current machine free login
# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Make the worker1 machine free login
# ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@worker1
Note that by dafult the identification will be saved in /home/hadoop/.ssh/id_rsa. As well the public key will be saved in /home/hadoop/.ssh/id_rsa.pub.
4 If you are using ubuntu desktop version and it does not allow to login without password, simply do
# sudo apt-get install ssh
This install the ssh sever on your machine, now you are able to login as ssh hadoop@master without password.
Now your machines are configured, now you need to setup the hadoop configuration.
Configure Hadoop
1 install hadoop to /usr/local
# cd /usr/local
# sudo tar xzf hadoop-0.20.2.tar.gz
# sudo mv hadoop-0.20.2 hadoop
# sudo chown -R hadoop:hadoop hadoop
2 All the files required to configure are stored in conf, so
# cd /usr/local/hadoop/conf
3 Edit the file hadoop-env.sh, specify the $JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-6-sun
4 You need to create a directory for hdfs (Hadoop Distributed File System).
# sudo mkdir $HOME/hdfs/local
# sudo chown -R hadoop:hadoop $HOME/hdfs/local
# sudo chmod 750 $HOME/hdfs/local
5 Edit core-site.xml, add in the following text.
hadoop.tmp.dir
fs.default.name
hdfs://master:54310
There are many variables need to be specify, however once you set the “hadoop.tmp.dir” variable, then it is unnecessary to set the following 4 varaibles, as they are all based on the variable “hadoop.tmp.dir”.
dfs.name.dir = ${hadoop.tmp.dir}/dfs/name
dfs.data.dir = ${hadoop.tmp.dir}/dfs/data
mapred.local.dir= ${hadoop.tmp.dir}/mapred/local
mapred.system.dir= ${hadoop.tmp.dir}/mapred/system
Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.
6 Edit mapred-site.xml, add in:
mapred.job.tracker
master:54311
7 Edit hdfs-site.xml, add in:
dfs.replication
2
8 Edit the file named master, add in the machine (master) that manages job tracker and the name node:
master
9 Edit the file named slaves, add in the machines that will run the map reduce jobs. Note that master machine can also process
map reduce jobs.
master
worker1
worker2
…
Now you configured all the required files, you will see how to run the hadoop software in the following.
Run a Hadoop Program
1 Login the master machine and go into the hadoop ditectory,
# ssh hadoop@master
# cd /usr/local/hadoop
2 If you are the first time using hadoop, you need to format the name node.
# ./bin/hadoop namenode -format
3 Start the hadoop daemons,
# ./bin/start-all.sh
4 To stop , do
# ./bin/start-all.sh
5 Check if the hadoop daemons are up and running, type
# jps
There are should be at least 4 JVMs running in daemons, i.e. jobTracker, taskTracker, name node, data node.
6 Check the ports are being listened.
# sudo netstat -plten | grep java
7 Prepare the dataset, you need to copy a dataset into hdfs (the Hadoop Distributed File System) so that you can process the dataset.
# ./bin/hadoop dfs -copyFromLocal /file/mydata mydata
8 See if you have the dataset in hdfs.
# ./bin/hadoop dfs -ls
9 Test the hadoop with the hadoop plus-in example,
# ./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount mydata mydata-output
“hadoop-version-example.jar” is a sample program plus in with hadoop, it counts the words in input files.
10 Retrieve the hadoop reseults
# ./bin/hadoop dfs -cat mydata-output/part-r-00000
11 Copy the results to the local file system
# ./bin/hadoop dfs -getmerge mydata-output /path/targetDir
12 Check the hadoop websites Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
* http://localhost:50030/ – web UI for MapReduce job tracker(s)
* http://localhost:50060/ – web UI for task tracker(s)
* http://localhost:50070/ – web UI for HDFS name node(s)
Known Bug
1 IPv6 problem. To disable IPv6, addthe following line to conf/hadoop-env.sh
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
2 I was getting this error when putting data into the dfs. The solution is strange and probably inconsistent: I erased all temporary data along with the namenode, reformatted the namenode, started everything up, and visited my “cluster’s” dfs health page (http://your_host:50070/dfshealth.jsp). The last step, visiting the health page, is the only way I can get around the error. Once I’ve visited the page, putting and getting files in and out of the dfs works great!
-
Recent Posts
Archives
Categories