Cluster Configuration
To configure a Cassandra cluster yourself so that Hadoop may
operate over its data, it's best to overlay a Hadoop cluster over your
Cassandra nodes.
You'll want to have a separate server for your Hadoop NameNode/JobTracker.
Then install a Hadoop TaskTracker and hadoop datanode on each of your Cassandra nodes.
That will allow the JobTracker to assign tasks to the Cassandra nodes that contain data for those tasks.
Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored.
You'll want to have a separate server for your Hadoop NameNode/JobTracker.
Then install a Hadoop TaskTracker and hadoop datanode on each of your Cassandra nodes.
That will allow the JobTracker to assign tasks to the Cassandra nodes that contain data for those tasks.
Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored.
The nice thing about having a TaskTracker
on every node is that you get data locality and your analytics engine
scales with your data.
You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance. Note that Cassandra implements the same interface as HDFS to achieve data locality.
You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance. Note that Cassandra implements the same interface as HDFS to achieve data locality.
A
note on speculative execution: you may want to disable speculative
execution for your hadoop jobs that either read or write to Cassandra.
This isn't required, but may be helpful to reduce unnecessary load.
One configuration note on getting the task trackers to be able to perform queries over Cassandra: you'll want to update your HADOOP_CLASSPATH in your <hadoop>/conf/hadoop-env.sh to include the Cassandra lib libraries. For example you'll want to do something like this in the hadoop-env.sh on each of your task trackers:
export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH
Virtual Datacenter
One thing
that many have asked about is whether Cassandra with Hadoop will be
usable from a random access perspective. For example, you may need to
use Cassandra for serving web latency requests. You may also need to run
analytics over your data.
In Cassandra 0.7+ there is the NetworkTopologyStrategy which allows you to customize your cluster's replication strategy by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes that serve data with high random-read performance from nodes that are meant to be used for analytics.
You need to have a snitch configured with your topology and then according to the datacenters defined there (either explicitly or implicitly), you can indicate how many replicas you would like in each datacenter.
You would install task trackers on nodes in your analytics section and make sure that a replica is written to that 'datacenter' in your NetworkTopologyStrategy configuration. The practical upshot of this is your analytics nodes always have current data and your high random-read performance nodes always serve data with predictable performance.
In Cassandra 0.7+ there is the NetworkTopologyStrategy which allows you to customize your cluster's replication strategy by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes that serve data with high random-read performance from nodes that are meant to be used for analytics.
You need to have a snitch configured with your topology and then according to the datacenters defined there (either explicitly or implicitly), you can indicate how many replicas you would like in each datacenter.
You would install task trackers on nodes in your analytics section and make sure that a replica is written to that 'datacenter' in your NetworkTopologyStrategy configuration. The practical upshot of this is your analytics nodes always have current data and your high random-read performance nodes always serve data with predictable performance.
For using PIG
- Set the HADOOP_HOME environment variable to <hadoop_dir>, e.g. /opt/hadoop or /etc/hadoop
- Set the PIG_CONF environment variable to <hadoop_dir>/conf
- Set the JAVA_HOME
Hadoop/Cassandra Cluster Configuration
The recommended cluster configuration essentially overlays Hadoop over Cassandra. This involves installing a Hadoop TaskTracker on each Cassandra node. Also – and this is important – one server in the cluster should be dedicated to the following Hadoop components:- JobTracker
- datanode
- namenode
Hadoop TaskTrackers and Cassandra Nodes
Running a Hadoop TaskTracker on a Cassandra node requires you to update the HADOOP_CLASSPATH in <hadoop>/conf/hadoop-env.sh to include the Cassandra libraries. For example, add an entry like the following in the hadoop-env.sh on each of the task tracker nodes:
export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATH
When a Hadoop TaskTracker runs on the same servers as the Cassandra
nodes, each TaskTracker is sent tasks only for data belonging to the
token range in the local Cassandra node.
This allows tremendous gains in
efficiency and processing times, as the Cassandra nodes receive only
queries for which they are the primary replica, avoiding the overhead of
the Gossip protocol.
Handling Input and Output from Cassandra
The class org.apache.cassandra.hadoop.ColumnFamilyInputFormat allows you to read data stored in Cassandra from a Hadoop MapReduce job, and its companion class org.apache.cassandra.hadoop.ColumnFamilyOutputFormat allows you to write the results back into Cassandra. These two classes should be properly set in your code as the format class for input and/or output:job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
ConfigHelper.setColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to Big Data Training Chennai
ReplyDeleteHi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.
ReplyDeleteTruely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)
ReplyDeleteSoftware testing training in chennai | Testing training in chennai | Software testing course in chennai
ReplyDeleteI havent any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us.
Cassandra Training in Chennai
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
ReplyDeletefire and safety course in chennai
Nice and good article. It is very useful and understands easily. Thanks for sharing this wonderful content.its very useful to us.keep posting such useful information.
ReplyDeleteSalesforce Training in Chennai
Salesforce Online Training in Chennai
Salesforce Training in Bangalore
Salesforce Training in Hyderabad
Salesforce training in ameerpet
Salesforce Training in Pune
Salesforce Online Training
Salesforce Training