Techie Talks: Cassandra with hadoop

Cluster Configuration

To configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes.

You'll want to have a separate server for your Hadoop NameNode/JobTracker.

Then install a Hadoop TaskTracker and hadoop datanode on each of your Cassandra nodes.
That will allow the JobTracker to assign tasks to the Cassandra nodes that contain data for those tasks.

Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored.

The nice thing about having a TaskTracker on every node is that you get data locality and your analytics engine scales with your data.

You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance. Note that Cassandra implements the same interface as HDFS to achieve data locality.

A note on speculative execution: you may want to disable speculative execution for your hadoop jobs that either read or write to Cassandra. This isn't required, but may be helpful to reduce unnecessary load.

One configuration note on getting the task trackers to be able to perform queries over Cassandra: you'll want to update your HADOOP_CLASSPATH in your <hadoop>/conf/hadoop-env.sh to include the Cassandra lib libraries. For example you'll want to do something like this in the hadoop-env.sh on each of your task trackers:

export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH

Virtual Datacenter

One thing that many have asked about is whether Cassandra with Hadoop will be usable from a random access perspective. For example, you may need to use Cassandra for serving web latency requests. You may also need to run analytics over your data.

In Cassandra 0.7+ there is the NetworkTopologyStrategy which allows you to customize your cluster's replication strategy by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes that serve data with high random-read performance from nodes that are meant to be used for analytics.

You need to have a snitch configured with your topology and then according to the datacenters defined there (either explicitly or implicitly), you can indicate how many replicas you would like in each datacenter.

You would install task trackers on nodes in your analytics section and make sure that a replica is written to that 'datacenter' in your NetworkTopologyStrategy configuration. The practical upshot of this is your analytics nodes always have current data and your high random-read performance nodes always serve data with predictable performance.

For using PIG

Set the HADOOP_HOME environment variable to <hadoop_dir>, e.g. /opt/hadoop or /etc/hadoop
Set the PIG_CONF environment variable to <hadoop_dir>/conf
Set the JAVA_HOME

Hadoop/Cassandra Cluster Configuration

The recommended cluster configuration essentially overlays Hadoop over Cassandra. This involves installing a Hadoop TaskTracker on each Cassandra node. Also – and this is important – one server in the cluster should be dedicated to the following Hadoop components:

JobTracker
datanode
namenode

This dedicated server is required because Hadoop uses HDFS to store JAR dependencies for your job, static data, and other required information. In the overall context of your cluster, this is a very small amount of data, but it is critical to running a MapReduce job.

Hadoop TaskTrackers and Cassandra Nodes

Running a Hadoop TaskTracker on a Cassandra node requires you to update the HADOOP_CLASSPATH in <hadoop>/conf/hadoop-env.sh to include the Cassandra libraries. For example, add an entry like the following in the hadoop-env.sh on each of the task tracker nodes:

export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH

When a Hadoop TaskTracker runs on the same servers as the Cassandra nodes, each TaskTracker is sent tasks only for data belonging to the token range in the local Cassandra node.

This allows tremendous gains in efficiency and processing times, as the Cassandra nodes receive only queries for which they are the primary replica, avoiding the overhead of the Gossip protocol.

Handling Input and Output from Cassandra

The class org.apache.cassandra.hadoop.ColumnFamilyInputFormat allows you to read data stored in Cassandra from a Hadoop MapReduce job, and its companion class org.apache.cassandra.hadoop.ColumnFamilyOutputFormat allows you to write the results back into Cassandra. These two classes should be properly set in your code as the format class for input and/or output:

job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);

In the MapReduce job, Cassandra rows or row fragments (pairs of key + SortedMap of columns) can be input to Map tasks for processing, as specified by a SlicePredicate that describes which columns to fetch from each row. For example:

ConfigHelper.setColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);

6 comments:

Stephen11 April 2015 at 22:21
Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to Big Data Training Chennai
Unknown26 August 2015 at 01:38
Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.
Unknown16 December 2015 at 03:03
Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

Software testing training in chennai | Testing training in chennai | Software testing course in chennai
Divit19 March 2016 at 01:46

I havent any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us.

Cassandra Training in Chennai
sunshineprofe26 October 2018 at 03:57
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
fire and safety course in chennai

vijay8 August 2020 at 22:19
Nice and good article. It is very useful and understands easily. Thanks for sharing this wonderful content.its very useful to us.keep posting such useful information.
Salesforce Training in Chennai

Salesforce Online Training in Chennai

Salesforce Training in Bangalore

Salesforce Training in Hyderabad

Salesforce training in ameerpet

Salesforce Training in Pune

Salesforce Online Training

Salesforce Training

Thursday, 31 May 2012

Cassandra with hadoop

Cluster Configuration

Virtual Datacenter

Hadoop/Cassandra Cluster Configuration

Handling Input and Output from Cassandra

6 comments: