Techie Talks: Cassandra and hadoop :which one is best??

There is a big confusion in every blog on which nosql out of HBase and cassandra is good.I think it depends on Use cases

Data Model

Cassandra

You have a token ring. Each node takes a section of the Ring.
Cassandra lets you decide between A Random Partitioner (hashing) and and Order Preserving Partitioner in order (by key).

If you use RandomPartitioner you can not range scan on keys (only the columns of a key) but you can use OOP and hash yourself.

Hbase

You have a regions. When regions grow large they split into sub regions.
You only have the keys in byte order therefore If you are inserting based on timestamp one region gets overloaded.Randomize your inserts if they are timestamps"
Practical people know you can hash yourself
HBase has an advantage if you want to range scan on keys
People know you most like will build your own secondary indexes as range scan on key is not going to give your every SQL feature you can wet dream of.

Setup

Cassandra

Cassandra is very easy to setup and get working out of the box
People know you are going to have to understand and tune anyway being able to boot up in seconds means not much. Just bringing up a node and letting it auto bootstrap will likely not be what you want

HBase

Hbase lives on top hadoop which is both a blessing and curse
People know setting up hadoop is not easy
People know hbase has many moving parts (nn,2nn,jt (for counts) datanodes task, trackers, zookeepers, hbase, masters
People know its a bigger stack

Cluster size

Cassandra

Cassandra is actually effective at small sizes 1, 3, 5 nodes
You do not get 'decent' scale out till ~7 nodes
People say "cool I can grow into this even at a small shop"

Hbase

Just too much set up and too many components to make sense at less then ~7 nodes
Many people are looking to solve big data problems so node number of nodes being greater then 7 is not an issue for them. They have some iron to throw at the problem.

Scale out

Cassandra

Cassandra allows your to move nodes and have new nodes take sections of the ring
People have noticed that you have to keep the cluster balanced which DOES take administrative work
People know moving from 11-15 nodes is intensive (compaction, stream, cleanup)
People see joins sometime fail and have to do it again (annoyed)

HBase

Regions split and move quite frequent and automatically
People see that they do not always move where you want
People see that automatic moving and splitting fails sometimes. 1/1000000 it might happens, and you may have to do surgery.

Release cycle

Cassandra

The Cassandra code base moves pretty quickly and is pretty agile.
People have waiting a while for critical features 'efficient moves'
There have been some recent releases that were followed very quickly by a bug fix release

Hbase

The hbase code base moves pretty quickly and is pretty agile.
people get worried by a blurring number of jiras per release
People need a map to figure out the hadoop hbase version matrix and what works together. hadoop 0.20 append ?

Consistency

Cassandra

Cassandra allows the user to chose consistency models

Hbase

Offers one consistency model

API & RPC

Cassandra

Thrift and only thrift

Hbase

pure java client

Administration

Cassandra

Admins are happy with one log
Log is under control in terms of events
Administration is more than just a log file and the wiki is skimpy
Best practices are skimpy too.
Schema design info is skimpy too.

hbase & hadoop & zk & ...

People see A LOT of log files and a lot of cryptic hadoop messages
People see this actually takes a team of people!

Facebook messaging & U

Hbase

Makes sense. Many nodes!
Makes sense. Many hadoop/hbase committers!
Makes sense. Really really really big scale!
Haters point out 'still uses memcache db'

Cassandra

At the time did NOT have online schema updates (does now in 7)
see section on scale out!
Try to get them next go around!

žTransactions

Cassandra

Cassandra does not support transactions in the sense that it keeps bundling multiple row updates into one all-or-nothing operation.Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.

However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable.žAll writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.action commit/rollback capabilities.

HBase

Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster.

In Cassandra, writing with a consistency level of ALL means that the data will be written to all N nodes responsible for the particular piece of data, where N is the replication factor, before the client gets a response. In a standard Cassandra configuration, the write goes into an in-memory table and an in-memory log for each node. The log is periodically batch flushed to disk; there is also an option to flush per commit, but this option severely impacts performance. Subsequent reads from any node are strongly consistent and get the most recent update.

In contrast, HBase has only one region server responsible for serving a given piece of data at any one time, and replication is handled on the HDFS layer. A client sends an update to the region server currently responsible for the update key, and the region server responds with an ack as soon as it updates its in-memory data structure and flushes the update to its write-ahead commit log. In older versions of HBase, the log was configured in a similar manner to Cassandra to flush periodically. As a few commenters have pointed out, the default configuration of more recent versions of HBase flush the commit log before acknowledging writes to the client, using group commit to batch flushes across writes for performance. Replication to the N HDFS nodes responsible for the written data still happens asynchronously, however. HBase ensures strong consistency by routing subsequent reads through the same region server and, if a region server goes down, by using a system of locks based on ZooKeeper so that reads take into account the latest update.

Because Cassandra writes data synchronously to all N nodes in this scheme whereas HBase writes data synchronously to only one node, Cassandra is necessarily slower. In this scheme, write latency in Cassandra is essentially bottlenecked by the slowest machine and subject to variance in network speeds, IO speeds, and CPU loads across machines. HBase pays a disk cost for its forced log sync, but in high throughput environments, group commit amortizes the disk cost across concurrent requests.

The tradeoff comes in availability. Because only the write-ahead log has been replicated to the other HDFS nodes, if the region server that accepted the write fails, the ranges of data it was serving will be temporarily unavailable until a new server is assigned and the log is replayed. On the other hand, Cassandra will still have and serve the data (given the read level of ONE) even if N-1 nodes responsible for the data go down.

Let me cap this by saying both products are awesome and have many awesome people heading them up..THERE ARE MANY CASES WHERE I WOULD CHOSE ONE OR THE OTHER!

Techie Talks

Monday, 4 June 2012

Cassandra and hadoop :which one is best??

1 comment: