Thursday, 31 May 2012

Cassandra with hadoop

Cluster Configuration


To configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes.

You'll want to have a separate server for your Hadoop NameNode/JobTracker

Then install a Hadoop TaskTracker and hadoop datanode on each of your Cassandra nodes.
 That will allow the JobTracker to assign tasks to the Cassandra nodes that contain data for those tasks.

 Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored.
The nice thing about having a TaskTracker on every node is that you get data locality and your analytics engine scales with your data.

You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance. Note that Cassandra implements the same interface as HDFS to achieve data locality.
A note on speculative execution: you may want to disable speculative execution for your hadoop jobs that either read or write to Cassandra. This isn't required, but may be helpful to reduce unnecessary load.


One configuration note on getting the task trackers to be able to perform queries over Cassandra: you'll want to update your HADOOP_CLASSPATH in your <hadoop>/conf/hadoop-env.sh to include the Cassandra lib libraries. For example you'll want to do something like this in the hadoop-env.sh on each of your task trackers:
export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH

Virtual Datacenter


One thing that many have asked about is whether Cassandra with Hadoop will be usable from a random access perspective. For example, you may need to use Cassandra for serving web latency requests. You may also need to run analytics over your data.

In Cassandra 0.7+ there is the NetworkTopologyStrategy which allows you to customize your cluster's replication strategy by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes that serve data with high random-read performance from nodes that are meant to be used for analytics.

You need to have a snitch configured with your topology and then according to the datacenters defined there (either explicitly or implicitly), you can indicate how many replicas you would like in each datacenter.

You would install task trackers on nodes in your analytics section and make sure that a replica is written to that 'datacenter' in your NetworkTopologyStrategy configuration. The practical upshot of this is your analytics nodes always have current data and your high random-read performance nodes always serve data with predictable performance. 

For using PIG
  • Set the HADOOP_HOME environment variable to <hadoop_dir>, e.g. /opt/hadoop or /etc/hadoop
  • Set the PIG_CONF environment variable to <hadoop_dir>/conf
  • Set the JAVA_HOME

Hadoop/Cassandra Cluster Configuration

The recommended cluster configuration essentially overlays Hadoop over Cassandra. This involves installing a Hadoop TaskTracker on each Cassandra node. Also – and this is important – one server in the cluster should be dedicated to the following Hadoop components:
  • JobTracker
  • datanode
  • namenode
This dedicated server is required because Hadoop uses HDFS to store JAR dependencies for your job, static data, and other required information. In the overall context of your cluster, this is a very small amount of data, but it is critical to running a MapReduce job.
Apache Cassandra Hadoop MapReduce
Hadoop TaskTrackers and Cassandra Nodes
Running a Hadoop TaskTracker on a Cassandra node requires you to update the HADOOP_CLASSPATH in <hadoop>/conf/hadoop-env.sh to include the Cassandra libraries. For example, add an entry like the following in the hadoop-env.sh on each of the task tracker nodes: 
export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH
When a Hadoop TaskTracker runs on the same servers as the Cassandra nodes, each TaskTracker is sent tasks only for data belonging to the token range in the local Cassandra node. 

This allows tremendous gains in efficiency and processing times, as the Cassandra nodes receive only queries for which they are the primary replica, avoiding the overhead of the Gossip protocol.

Handling Input and Output from Cassandra

The class org.apache.cassandra.hadoop.ColumnFamilyInputFormat allows you to read data stored in Cassandra from a Hadoop MapReduce job, and its companion class org.apache.cassandra.hadoop.ColumnFamilyOutputFormat allows you to write the results back into Cassandra. These two classes should be properly set in your code as the format class for input and/or output:

job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
In the MapReduce job, Cassandra rows or row fragments (pairs of key + SortedMap of columns) can be input to Map tasks for processing, as specified by a SlicePredicate that describes which columns to fetch from each row. For example:
ConfigHelper.setColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);



6 comments:

  1. Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to Big Data Training Chennai

    ReplyDelete
  2. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.

    ReplyDelete
  3. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Testing training in chennai | Software testing course in chennai

    ReplyDelete


  4. I havent any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us.


    Cassandra Training in Chennai

    ReplyDelete
  5. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    fire and safety course in chennai


    ReplyDelete