Thursday 17 January 2013

Executing Eclipse project

To open eclipse go to root and cd location of eclipse
type eclipse
Eclipse should open up create the classes

Create a MapReduce prooject and java class and export it to jar file and save it in some location.
Let this location be /usr/mr

type on hadoop bin
hadoop jar /usr/mr/wc.jar  [src folder]  [target folder]
hadoop jar /usr/mr/wc.jar  [i/p file]  //when output is getting saved in hbase tables or hdfs files mentioned in the code.
hadoop jar /usr/mr/wc.jar  //when both the locations are provided beforehand

Wednesday 16 January 2013

Thrift installation

Apache Thrift Installation Tutorial:
The official documentation for installing/using Apache Thrift is currently somewhat lacking. The following are step-by-step instructions on installing Apache Thrift and getting the sample project code running on either a fresh Ubuntu 10.10 installation
1. Install the necessary dependencies.
# sudo apt-get install libssl-dev libboost-dev flex bison g++
2. Download Apache Thrift 0.7 at:
# wget http://archive.apache.org/dist/thrift/0.7.0/thrift-0.7.0.tar.gz
3. Untar the tarball to your project directory:
# cd ~/project
# tar –xzvf ~/thrift-0.7.0.tar.gz
4. Run configure (turning off support for other unused languages)
# cd thrift-0.7.0
# chmod u+x configure install-sh
# ./configure --prefix=${HOME}/project --exec-prefix=${HOME}/project --with-python=no --with-erlang=no --with-java=no --with-php=no --with-csharp=no --with-ruby=no
# make
# make install
5. Download the sample code from the course website:
# cd
# wget http://www.cs.uwaterloo.ca/~bernard/courses/cs454/sample-0.1.1.tar.gz
# cd project
# tar –xzvf ~/sample-0.1.1.tar.gz
6. Compile the WatDHT.thrift file:
# cd sample
# ~/project/bin/thrift --strict --gen cpp WatDHT.thrift
7. Replace the first 4 lines of the Makefile in the sample code with the following:
CXX = g++
CPPFLAGS = -g -fpermissive -Wall -I. -I${HOME}/project/include -I${HOME}/project/include/thrift -Igen-cpp
LDFLAGS = -L${HOME}/project/lib -lthrift -lpthread -lcrypto
LD = g++
8. Compile the sample code:
# make
9. Add to LD_LIBRARY_PATH (assuming you are using bash as your shell):
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HOME}/project/lib

Tuesday 15 January 2013

HBase and Hive Thrift PHP Client

Start the HBase and Hive Thrift server via shell:

Download HBase and Thrift php client package and write your own client:

Thrift installation

In this tutorial, I am going to explain how to use python and thrift to access HBase. Here is the summary of steps you will need to follow:
1) Download thrift
2) Install thrift dependencies
3) Compile and install thrift
4) Generate HBase thrift python module
5) Add HBase thrift python module to pythonpath
6) Start HBase thrift server
7) Use the client!
Following is the detailed explanation of the steps. I am assuming that you will be using ubuntu as your development environment. That’s what I use. I am also assuming that HBase is installed and you have HBASE_HOME defined in the environment.
1) Download thrift
Download thrift by clicking on the link embedded in this sentence.
Unzip the tar.gz file using tar -xvzf  thrift-0.3.0.tar.gz. Let’s say you unzipped it in /home/horcrux/Software/thrift-0.3.0/
2) Install thrift dependencies
Thrift requires many packages for compilation. It requires boost c++ libraries, flex, mkmf and other build essentials. You can install all the dependencies by executing the following commands. ruby1.8-dev is to get mkmf installed.
sudo apt-get install build-essential
sudo apt-get install libboost1.40-dev
sudo apt-get install flex
sudo apt-get install ruby1.8-dev

3) Compile and install thrift
Execute the following commands to compile and install thrift
cd /home/horcrux/Software/thrift-0.3.0/
./configure
make
sudo make install

Now let’s install thrift python. The following command will make sure that the thrift module is in your pythonpath.
cd /home/horcrux/Software/thrift-0.3.0/lib/py
sudo python setup.py install

4) Generate HBase thrift python module
Once this is done, you should have thrift in your path. You should be able to execute thrift command from anywhere. Now let’s generate the Hbase thrift modeule from the Hbase.thrift config file.
thrift --gen py $HBASE_HOME/src/java/org/apache/hadoop/hbase/thrift/Hbase.thrift
This command will create gen-py folder in your thrift folder (/home/horcrux/Software/thrift-0.3.0).
5) Add HBase thrift python module to pythonpath
We need to add gen-py folder to python path. You can do so by multiple ways
a) You can add it directly at the top of your python file
import sys
sys.path.append('/home/horcrux/Software/thrift-0.3.0/gen-py')

or
b) If you are using an IDE like pydev, add it as a pythonpath source folder.
or
c) add it to pythonpath environemnt variable in your .bashrc.
export PYTHONPATH=$PYTHONPATH:/home/horcrux/Software/thrift-0.3.0/gen-py
6) Start HBase thrift server
You can simply start the thrift server by executing the following command:
$HBASE_HOME/bin/hbase thrift start
This will start HBase thrift server on port 9090 (default port).
7) Use the client!
Here is a sample code that will print all the table names on your HBase server:
from thrift.transport.TSocket import TSocket
from thrift.transport.TTransport import TBufferedTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase


transport = TBufferedTransport(TSocket('localhost', 9090))
transport.open()
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hbase.Client(protocol)
print(client.getTableNames())

That’s it.

Wednesday 9 January 2013

Fastest HBase Write using HBase Bulk Load

While you are trying to put Millions and even billions of key-values into HBase from your MR job, you can feel, even TableOutPutFormat is not that much efficient.
In such cases you can use HBase's Bulk load feature, which is tremendously faster than TableOutPutFormat.

The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster.
The process consists of 2 main steps.

  • Preparing data via a MapReduce job
Data here refers to as the HBase data files(StoreFiles).
To achieve the same we need to change the OutPutFormat class of our MR job to HFileOutputFormat, which writes out data in HBase's internal storage format.

The following are the main changes that you have to make in your MR job,
.....
        mapRedJob.setMapOutputKeyClass(ImmutableBytesWritable.class);
        mapRedJob.setMapOutputValueClass(Put.class);

        mapRedJob.setInputFormatClass(TextInputFormat.class);
        mapRedJob.setOutputFormatClass(HFileOutputFormat.class);
.....
   //HBase configuration
   Configuration hConf = HBaseConfiguration.create(hadoopConf);
        hConf.set("hbase.zookeeper.quorum", zookeeper);
        hConf.set("hbase.zookeeper.property.clientPort", port);
        HTable hTable = new HTable(hConf, tableName);
        HFileOutputFormat.configureIncrementalLoad(mapRedJob, hTable);
.....

A test map method would look like the following,
.....
   public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            Put row = new Put(Bytes.toBytes(value.toString()));
            row.add(Bytes.toBytes("CF"), Bytes.toBytes("C"), Bytes.toBytes(value.toString()));
            try {
                context.write(new ImmutableBytesWritable(Bytes.toBytes(value.toString())), row);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
.....

  • Loading the Data into the HBase Table
Data can be loaded into the cluster using the command line tool 'completebulkload'.
The format is as follows,
$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/myoutput mytable.
You can also load these files from your MR job programmatically by using the following code,


LoadIncrementalHFiles lihf = new LoadIncrementalHFiles(hConf);
         lihf.doBulkLoad(new Path(hfileOutPutPath), hTable);


Try it and feel the performance improvement.