Friday 13 July 2012

Add the following line to your hbase-env.sh file:

export JAVA_HOME=/usr/lib/jvm/java-6-sun
In the same file change this line:

export HBASE_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
to look like:

export HBASE_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Djava.net.preferIPv4Stack=true"
Now, modify your “regionservers” file to list all of the machines you want to host regions. Think of an Hbase region as a small chunk of the data in your database. The more regionservers you have, the more data you can reliably serve. In my cluster, the regionservers are the same nodes as all of my datanodes, and all of my tasktrackers. So, essentially, the “regionservers” file should be identical to your “slaves” file from the hadoop tutorial.
Next, modify the hbase-site.xml file. The settings in this file over-write those in hbase-default.xml, so if you want to see a list of available settings to configure, then study that file, but only make changes to your hbase-site.xml. Add the following settings to hbase-site.xml:


    hbase.rootdir
    hdfs://$master$/hbase


    hbase.cluster.distributed
    true


    hbase.zookeeper.quorum
    $slave1$,$slave2$,$slave3$


    hbase.zookeeper.property.dataDir
    /hadoop/zookeeper/data

Please remember to replace $master$ and $slaveX$ with your master and slave host names respectively. You may have read that Hbase 0.20 now requires zookeeper, but fear not, the above configuration directives allow hbase to completely manage zookeper on it’s own, you never have to mess with it. Now, it is typically recommended to always run zookeeper on dedicated zookeeper only servers. If you are running a small cluster, then this is hardly efficient, because you want as many nodes “working” as possible. While I can’t give you recommendations of the maximum cluster size you can have before requiring dedicated zk nodes, I can tell you that my 6 slave nodes run datanode, tasktracker, regionserver, and zookeeper without too much of a problem. I would imagine that if you have over 10 nodes in your cluster, then you shouldn’t have a problem dedicating a few for zookeeper. They also recommend (maybe even require) that zookeeper runs on an odd number of machines. I don’t completely understand how zookeeper works, but basically as long as you still have more than half of your “quorum” in tact, then your cluster won’t fail. In essence, if your zk quorum has 7 nodes, you can lose 3 nodes without any adverse affects, a 35 node quorum could theoretically lose 17 nodes, and still operate. I think basically zookeeper is used to keep track of the locations of regions, so your quorum will notify any clients, and fellow regionservers where to find the data they are looking for. If zk becomes overloaded, then your regionservers can time out and crash, and potentially lose data if they haven’t flushed to disk yet. So make sure you have enough horsepower for your application. In my cluster, the hbase.zookeeper.quorum directive is simply a comma separated list of all of my slave nodes, including my master. If you have an odd number of slaves (even number counting your master), then just leave the master out of the list. If you have more than ten slaves, then consider dedicating 3 of them to zookeeper if you have problems with regionservers timing out. The logs will tell you if that is the case.

mkdir -p /hadoop/zookeeper/data && echo 'X' > /hadoop/zookeeper/data/myid

It is imperative that you replace the ‘X’ with ’0′, on the first node in your quorum, ’1′ on the second, ’2′ on the third and so on. This file allows the node to identify itself in the zk quorum.
Once all that per node work is done, you can finally start your hbase instance. From the master /hadoop/hbase directory run:

bin/start-hbase.sh

No comments:

Post a Comment