1. Networking setup
Make sure the machines are able to reach each other on the network. Also update /etc/hosts on all machines. For example in our setup we use:# /etc/hosts on all nodes
192.168.1.96 master
192.168.1.97 slave1
192.168.1.98 slave2
192.168.1.99 slave3
2. Java
Install java 6 if not already installed.$ sudo add-apt-repository "deb http://archive.canonical.com/ubuntu maverick partner"
$ sudo apt-get update
$ sudo apt-get install sun-java6-jre
3. Hadoop installation
Add Cloudera CDH3 apt repository.$ sudo add-apt-repository "deb http://archive.cloudera.com/debian maverick-cdh3 contrib"
$ wget -O - http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
On master:
$ sudo apt-get install hadoop-0.20-{namenode,datanode,jobtracker,tasktracker}
$ sudo apt-get install hadoop-0.20-{datanode,tasktracker}
You can find Clouderaís patches for hadoop here: /usr/lib/hadoop-0.20/cloudera/patches
4. Hadoop configuration.
Configure hadoop on all machines:Edit core-site.xml
<property>
<name> fs.default.name </name>
<value> hdfs://master:8020 </value>
</property>
<property>
<name> dfs.replication </name>
<value> 2 </value>
</property>
<property>
<name> mapred.job.tracker </name>
<value> master:8021 </value>
</property>
Step1: Starting HDFS
The namenode daemon must be started on master:
$ sudo service hadoop-0.20-namenode start
$ sudo service hadoop-0.20-datanode start
Check namenode log on master
$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-namenode-demo.log
$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-datanode-demo.log
The jobtracker daemon must be started on master:
$ sudo service hadoop-0.20-jobtracker start
$ sudo service hadoop-0.20-tasktracker start
Check jobtracker log on master
$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-jobtracker-demo.log
$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-tasktracker-demo.log
5. Install Cassandra
You must do this on all cassandra nodes. A common practice is to install cassandra on every hadoop datanode. So every hadoop datanode will also be a cassandra node.Import cassandraís apt repository key
gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
gpg --export --armor F758CE318D77295D | sudo apt-key add -
Add cassandraís apt repository entries in /etc/apt/sources.list and install cassandra.
$ sudo add-apt-repository "deb http://www.apache.org/dist/cassandra/debian unstable main"
$ sudo apt-get update
$ sudo apt-get install cassandra
6. Configure Cassandra
Edit cassandra.yaml on all nodes, replace hostname_or_ip with the hostname or the ip of the node.listen_address:
rpc_address:
$ /usr/bin/nodetool -host ring
7. Balance Cassandra Cluster
If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.Here's a python program which can be used to calculate new tokens for the nodes.
def tokens(nodes):
for x in xrange(nodes):
print 2 ** 127 / nodes * x
The status of move and balancing operations can be monitored using nodetool with the streams argument.
/usr/bin/nodetool -host move
No comments:
Post a Comment