Configure Hadoop

admin — Fri, 10 Dec 2010 16:19:03 +0000

Configure Hadoop Machines
1 Create hadoop user. Although you don’t have to create a hadoop user, but this is a good practice.
# sudo addgroup hadoop
# sudo adduser –ingroup hadoop hadoop
# sudo adduser hadoop admin
Now you can longin by
# su – hadoop
You need to create the hadoop user on each machines in your cluster.
2 To make you easy to reference the remote machines, edit the file /etc/hosts, so that the remote machines can be recognized by alias. For example you have two machines with ip 192.168.0.1 and 192.168.0.2, you want to rename them to master and worker1.
192.168.0.1 {tab} master
192.168.0.2 {tab} worker1
3 Configure ssh to login without password. First , you need to generate an SSH key.
# ssh-keygen -t rsa -P “”
Make the current machine free login
# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Make the worker1 machine free login
# ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@worker1
Note that by dafult the identification will be saved in /home/hadoop/.ssh/id_rsa. As well the public key will be saved in /home/hadoop/.ssh/id_rsa.pub.
4 If you are using ubuntu desktop version and it does not allow to login without password, simply do
# sudo apt-get install ssh
This install the ssh sever on your machine, now you are able to login as ssh hadoop@master without password.
Now your machines are configured, now you need to setup the hadoop configuration.
Configure Hadoop
1 install hadoop to /usr/local
# cd /usr/local
# sudo tar xzf hadoop-0.20.2.tar.gz
# sudo mv hadoop-0.20.2 hadoop
# sudo chown -R hadoop:hadoop hadoop
2 All the files required to configure are stored in conf, so
# cd /usr/local/hadoop/conf
3 Edit the file hadoop-env.sh, specify the $JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-6-sun
4 You need to create a directory for hdfs (Hadoop Distributed File System).
# sudo mkdir $HOME/hdfs/local
# sudo chown -R hadoop:hadoop $HOME/hdfs/local
# sudo chmod 750 $HOME/hdfs/local
5 Edit core-site.xml, add in the following text.
hadoop.tmp.dir
fs.default.name
hdfs://master:54310
There are many variables need to be specify, however once you set the “hadoop.tmp.dir” variable, then it is unnecessary to set the following 4 varaibles, as they are all based on the variable “hadoop.tmp.dir”.
dfs.name.dir = ${hadoop.tmp.dir}/dfs/name
dfs.data.dir = ${hadoop.tmp.dir}/dfs/data
mapred.local.dir= ${hadoop.tmp.dir}/mapred/local
mapred.system.dir= ${hadoop.tmp.dir}/mapred/system
Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.
6 Edit mapred-site.xml, add in:
mapred.job.tracker
master:54311
7 Edit hdfs-site.xml, add in:
dfs.replication
2
8 Edit the file named master, add in the machine (master) that manages job tracker and the name node:
master
9 Edit the file named slaves, add in the machines that will run the map reduce jobs. Note that master machine can also process
map reduce jobs.
master
worker1
worker2
…
Now you configured all the required files, you will see how to run the hadoop software in the following.
Run a Hadoop Program
1 Login the master machine and go into the hadoop ditectory,
# ssh hadoop@master
# cd /usr/local/hadoop
2 If you are the first time using hadoop, you need to format the name node.
# ./bin/hadoop namenode -format
3 Start the hadoop daemons,
# ./bin/start-all.sh
4 To stop , do
# ./bin/start-all.sh
5 Check if the hadoop daemons are up and running, type
# jps
There are should be at least 4 JVMs running in daemons, i.e. jobTracker, taskTracker, name node, data node.
6 Check the ports are being listened.
# sudo netstat -plten | grep java
7 Prepare the dataset, you need to copy a dataset into hdfs (the Hadoop Distributed File System) so that you can process the dataset.
# ./bin/hadoop dfs -copyFromLocal /file/mydata mydata
8 See if you have the dataset in hdfs.
# ./bin/hadoop dfs -ls
9 Test the hadoop with the hadoop plus-in example,
# ./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount mydata mydata-output
“hadoop-version-example.jar” is a sample program plus in with hadoop, it counts the words in input files.
10 Retrieve the hadoop reseults
# ./bin/hadoop dfs -cat mydata-output/part-r-00000
11 Copy the results to the local file system
# ./bin/hadoop dfs -getmerge mydata-output /path/targetDir
12 Check the hadoop websites Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
* http://localhost:50030/ – web UI for MapReduce job tracker(s)
* http://localhost:50060/ – web UI for task tracker(s)
* http://localhost:50070/ – web UI for HDFS name node(s)
Known Bug
1 IPv6 problem. To disable IPv6, addthe following line to conf/hadoop-env.sh
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
2 I was getting this error when putting data into the dfs. The solution is strange and probably inconsistent: I erased all temporary data along with the namenode, reformatted the namenode, started everything up, and visited my “cluster’s” dfs health page (http://your_host:50070/dfshealth.jsp). The last step, visiting the health page, is the only way I can get around the error. Once I’ve visited the page, putting and getting files in and out of the dfs works great!

Using c4.5

admin — Mon, 31 Aug 2009 16:53:03 +0000

C4.5 is a wide used supervised learning algorithm, which is famous as one of the classic decision tree induction algorithm, invented by Ross Quinlan. It can perform both classification task and regression task. C4.5 is particularly interested by some data analysis systems due to its ability to generate rules. Its variations can be found in most existing data mining systems, i.e. J48 in weka.
You can download the c4.5 release 8 from this page: rulequest. Yet this version doesn’t contain the batch test(analysis) mode.
So if you need to perform a procedure as follows:
1 Build a classifier on one dataset
2 Then make analysis on another dataset using the built classifier
You may be interested in using the following version of c4.5. The one offered by Ross can only build the classifier and test the classifier at one time, otherwise analysis the given data set iteratively one sample by one sample.
c4.5 of tssg-kdd version Download file
How to use it ?
Assume you are using linux terminal,
$ tar -xf c45.tar
$ cd c45/src
$ chmod 755 cit
$ cit
$ cd ../run
Now you are ready to run c4.5 software, to build a classifier ,
$ c4.5 -f german #german.data is the dataset
then a classifier called german.tree is produced. To use it to preform analysis on a data set call corea.data,
$mclassify german corea #mclassify classifier-name dataset-name

Incremental learning

admin — Wed, 26 Aug 2009 13:28:23 +0000

In industrial life, data usually become available gradually, this
fact requires data analysis systems to have the capability to learn
information incrementally. Learning from new data without forgetting
prior knowledge is known as incremental learning. Its requirement
become challenge since most fundamental supervised learning
algorithms are lack of the ability to incremental learning, in most
of these cases the involved data analysis systems would rebuild the
new classifiers on the new data set, unfortunately, these procedures
normally lead to the phenomenon known as “catastrophic forgetting”,
the previously learned information lost, the result could be even
worse if the old data are no longer available.

Members of the TSSG-KDDG Group

admin — Thu, 07 Feb 2008 18:45:22 +0000

Dr Willie Donnelly (wdonnelly :AT: tssg.org)
Micheal O Foghlu (mofoghlu :AT: tssg.org)
Barry Downes (bdownes :AT: tssg.org)
Dr Huaiguo Fu (hfu :AT: tssg.org)
Eric Robson (erobson :AT: tssg.org)
Bernard Butler (bbutler :AT: tssg.org)
Annie Ibrahim (aibrahim :AT: tssg.org)