Techie Talks

Monday, 4 June 2012

What’s the Difference Between a a SuperColumn and a SubColumn in Cassandra?

First, remember that in Cassandra terminology, “subcolumn” = “supercolumn” = “sub column” = “supercolumn”.
With that in mind, a “super column family” is really just a “column family…that contains super columns under its rows”. (As opposed to a regular “column family” that merely contains rows without supercolumns.)

The confusion comes about because “super column family” entries look like this:

<ColumnFamily Name="Super1"

              ColumnType="Super"

              CompareWith="BytesType"

              CompareSubcolumnsWith="BytesType" />

..and plain old “column family” entries look like this:

<ColumnFamily Name="Regular1"

              CompareWith="BytesType" />

…both use a tag named “ColumnFamily” in Cassandra’s “storage-conf.xml” definition file.
Personally, I prefer using the term “Column Family” to cover both column families with rows that contain supercolumns as well as column families with rows that don’t contain supercolumns. But if someone uses the term “super column family” they always mean “a column family that contains rows that contain supercolumns.”

This article covers the difference between a supercolumn and a subcolumn in Cassandra.
Let me cut to the chase: there is no difference. They are two terms for exactly the same thing.
If you are familiar with a typical keystore->column family->row->super column->column structure, such as the one pictured below, then you could safely replace all instances of the phrase “super column” with “subcolumn” without changing the meaning.

The confusion around “super column” vs. “sub column” is fueled largely by the Cassandra configuration file. In your “storage-conf.xml” file you will see XML “ColumnFamily” configuration elements like this:

<ColumnFamily Name="Super1"

              ColumnType="Super"

              CompareWith="BytesType"

              CompareSubcolumnsWith="BytesType" />

If this was was a plain old “ColumnFamily” entry, you would only see this:

<ColumnFamily Name="Regular1"

              CompareWith="BytesType" />

…but this is a “Super Column Family”, so there are two extra attributes:

ColumnType=”Super” to tell Cassandra that this column family will contain super columns.
CompareSubcolumnsWith=”BytesType” to tell Cassandra that our sub columns will be sorted through bit-by-bit comparison.

Confused? If so, go back and read the last two bullets again while telling yourself:
“super column = sub column = supercolumn = subcolumn…”

Introduction to Cassandra Columns, Super Columns and Rows

This article provides new users the basics they need to understand Cassandra’s “column / super column / row” data model.
Though the focus is not on mechanics, this article assumes you are familiar with adding columns to and requesting data from existing keyspaces on Cassandra.
Remember that a Cassandra column is basically a “name=value” pair* (e.g., “color=red”). You can use multiple columns to represent data such as

"Price" : "29.99",

"Section" : "Action Figures" 

JSON representation is

{

  "Transformer" : {

    "Price" : "29.99",

    "Section" : "Action Figures"

  }

  "GumDrop" : {

    "Price" : "0.25",

    "Section" : "Candy"

  }

  "MatchboxCar" : {

    "Price" : "1.49",

    "Section" : "Vehicles"

  }

}

The keys used to group related columns into rows in this example were “Transformer”, “GumDrop” and “MatchboxCar”.

In JSON, this keystore->column family->row->column data structure would be represented like this:

{

  "ToyStore" : {

    "Toys" : {

      "GumDrop" : {

        "Price" : "0.25",

        "Section" : "Candy"

      }

      "Transformer" : {

        "Price" : "29.99",

        "Section" : "Action Figures"

      }

      "MatchboxCar" : {

        "Price" : "1.49",

        "Section" : "Vehicles"

      }

    }

  },

  "Keyspace1" : null,

  "system" : null

}

If you simply wanted to add other types of unrelated collections of information (e.g., “BugCollection” or “PaintColors”), you’d simply keep adding new keyspaces for each new collection. However, if you needed to keep track of similar collections of data (e.g., your Ohio and New York toy stores instead of a single toy store) you’d need to turn to a different kind of Cassandra element: the “super column”.
To see super columns in action, inspect this keystore->column family->row->super column->column data structure as it appears in JSON:

{

  "ToyCorporation" : {

    "ToyStores" : {

      "Ohio Store" : {

        "Transformer" : {

          "Price" : "29.99",

          "Section" : "Action Figures"

        }

        "GumDrop" : {

          "Price" : "0.25",

          "Section" : "Candy"

        }

        "MatchboxCar" : {

          "Price" : "1.49",

          "Section" : "Vehicles"

        }

      }

      "New York Store" : {

        "JawBreaker" : {

          "Price" : "4.25",

          "Section" : "Candy"

        }

        "MatchboxCar" : {

          "Price" : "8.79",

          "Section" : "Vehicles"

        }

      }

    }

  }

}

This data could also be visualized like this:

Given its late appearance, you might expect that “Ohio Store” and “New York Store” would represent super columns that span multiple rows. However, the opposite is true: “Ohio Store” and “New York Store” are now the row keys and entries like “Transformer”, “GumDrop” and “MatchboxCar” have become super columns keys.
Like column keys, super column keys are indexed and sorted by a specific type (e.g., “UTF8Type”, ”AsciiType”, “LongType”, “BytesType”, etc.). However, like row keys, super column entries have no values of their own; they are simply used to collect other columns.
Notice that the keys of the two groups of super columns do not match. ({“Transformer”, “GumDrop”, “MatchboxCar”} does not match {“JawBreaker”, “MatchboxCar”}. ) This is not an error: super column keys in different rows do not have to match and often will not.

Migrate a Relational Database Structure into a JSON

JSON stands for “JavaScript Object Notation” and is an efficient way to transfer complex information about specific entities between two separate programs.
As the “JavaScript” name implies, JSON is often used to transfer information between JavaScript-interpreting web browsers and JSON-aware web applications. In fact, native understanding of JSON is now built into most web browsers’ JavaScript interpreters.

The Original Relational Database Structure
We are going to start with a very simple 1:N relational database structure. Our first two tables are “forests” and “famoustrees”. Here is our data in tabular format:

forests:

famoustrees:

“famoustrees” is linked to “forests” using the “forestID” foreign key. Notice that there are no famous trees in the “Lonely Grove” forest, one famous tree in the “100 Acre Woods” and two famous trees in the “Black Forest”.
If we were to represent the data in our database – call it our “biologicalfeatures” database – in JSON, it would look like this:

{

  "biologicalfeatures":

    {

    "forests" :

      {

      "forest003" :

        {

          "name" : "Black Forest",

          "trees" : "two million",

          "bushes" : "three million"

        },

      "forest045" :

        {

          "name" : "100 Acre Woods",

          "trees" : "four thousand",

          "bushes" : "five thousand"

        },

      "forest127" :

        {

          "name" : "Lonely Grove",

          "trees" : "none",

          "bushes" : "one hundred"

        }

      },

    "famoustrees" :

      {

      "tree12345" :

        {

          "forestID" : "forest003",

          "name" : "Der Tree",

          "species" : "Red Oak"

        },

      "tree12399" :

        {

          "forestID" : "forest045",

          "name" : "Happy Hunny Tree",

          "species" : "Willow"

        },

      "tree32345" :

        {

          "forestID" : "forest003",

          "name" : "Das Ubertree",

          "species" : "Blue Spruce"

        }

      }

    }

}

Denormalizing the Tables
To collapse the famoustrees table into our forests table, we need to move each famoustree entry underneath its forest entry. We can also also remove the foreign “forestID” key from each famoustree entry – we don’t need that anymore.

However, we should retain the type of each famoustree entry we moved into the forest entry. We can do this by adding an extra “type” value to each entry.

Finally, we could break out the original non-ID information in each forest entry into a typed section too. We’ll tag each of these sections with a new ID of “generalinfo”. (This is a Cassandra-friendly convention – we’ll get into this more below.)

Represented in JSON, our data now looks like this:

{

  "biologicalfeatures":

    {

    "forests" :

      {

      "forest003" :

        {

        "generalinfo" :

          {

          "name" : "Black Forest",

          "trees" : "two million",

          "bushes" : "three million"

          },

        "tree12345" :

          {

            "type" : "famoustree",

            "name" : "Der Tree",

            "species" : "Red Oak"

          },

        "tree32345" :

          {

            "type" : "famoustree",

            "name" : "Das Ubertree",

            "species" : "Blue Spruce"

          }

        },

      "forest045" :

        {

        "generalinfo" :

          {

          "name" : "100 Acre Woods",

          "trees" : "four thousand",

          "bushes" : "five thousand"

          },

        "tree12399" :

          {

            "type" : "famoustree",

            "name" : "Happy Hunny Tree",

            "species" : "Willow"

          }

        },

      "forest127" :

        {

        "generalinfo" :

          {

          "name" : "Lonely Grove",

          "trees" : "none",

          "bushes" : "one hundred"

          }

        }

      }

    }

}

Ready for Cassandra?
There are really only two types of JSON data structures that can be imported directly into Cassandra. One is the
keystore->columnfamily->rowkey->column
data structure shown below:

{

  "keystore":

    {

    "columnfamily" :

      {

      "rowkey" :

        {

          "column name" : "column value"

        }

      }

    }

}

Add another layer and you get the other supported data structure
keystore->columnfamily (a.k.a. “supercolumnfamily”)->rowkey->supercolumn (a.k.a. “subcolumn”)->column
shown below:

{

  "keystore":

    {

    "columnfamily" :

      {

      "rowkey" :

        {

        "supercolumn" :

          {

          "column name" : "column value"

          }

        }

      }

    }

}

That’s it: if you can get your data to fit into one of those two JSON structures, your data is ready to be input into Cassandra.
You probably suspect that I wouldn’t have taken you this far if our forests data wasn’t ready for Cassandra, but please take a moment to scroll up and see if you can figure out whether our denormalized forests data uses supercolumns or not.
Let’s break it down:
biologicalfeatures -> forests
…matches the keystore->columnfamily structure used by both supported JSON structures.
As for the rest:
forest003 -> generalinfo -> (name=”Black Forest”)
…matches the rowkey->supercolumn->column structure used by the “supercolumn” supported JSON structure.
So, yes, we had to use supercolumns to denormalize the forests and famoustrees tables properly.

How to Add and Retrieve Data from a Cassandra Database

This article describes how to create a new keyspace on a Cassandra database server, how to add data to that keyspace and how to run some simple queries against that data.

Create a New Keyspace on Cassandra
First, sign on to your Cassandra server using the “cassandra-cli” client. Use the “show keyspaces” command to ensure you have a live connection to the server and to make sure the keyspace you are about to add doesn’t already exist.

cassandra> show keyspaces
Keyspace1
system

These two keyspaces are automatically installed when you installed Cassandra and are completely independent of one another – like separate databases on a relational database system would be. A diagram of two separate keyspaces in our Cassandra database would look like this:

We want to add a new keyspace called “ToyStore”. Once we’re done, we’d expect our diagram to look like this:

To create an new, empty keyspace called “ToyStore” Type
CREATE KEYSPACE ToyStore with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'and strategy_options = [{replication_factor:1}];

Use ToyStore ;
.
This will create a “column family” (if you’re a relational database user think “table” for now) called “Toys”.
The “Comparator” attribute in the “ColumnFamily” tag you just added controls how row keys are indexed and sorted. The “UTF8Type” value key indicates that you’re indexing by UTF8 characters. Other possible values include”AsciiType”, “LongType” (64-bit long integers) and “BytesType” (straight bit-to-bit comparison – the default value).

CREATE COLUMN FAMILY Toys
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: name, validation_class: UTF8Type}
{column_name: price, validation_class: LongType}
];

Add Data To An Existing Keyspace on Cassandra
Now that we have a new “ToyStore” keyspace it’s time to add some data. If you were watching closely you’ll notice that we did more than add a keystore in the previous step: we added our first “column family” too. (Think “table” if you’re coming from a relational database background.)

To get started adding data, restart your Cassandra client and use the following syntax to add six name/value pairs to the “Toys” column family of your new “ToyStore” keyspace.

cassandra> set ToyStore.Toys['Transformer']['Price'] = ’29.99′
Value inserted.
cassandra> set ToyStore.Toys['GumDrop']['Price'] = ’0.25′
Value inserted.
cassandra> set ToyStore.Toys['MatchboxCar']['Price'] = ’1.49′
Value inserted.
cassandra> set ToyStore.Toys['Transformer']['Section'] = ‘Action Figures’
Value inserted.
cassandra> set ToyStore.Toys['GumDrop']['Section'] = ‘Candy’
Value inserted.
cassandra> set ToyStore.Toys['MatchboxCar']['Section'] = ‘Vehicles’
Value inserted.

If you run a “help” command from the Cassandra client you will see the following syntax for the kind of “set” command we just used:

1	`set` `<ksp>.<cf>['<key>']['<col>'] =` `'<value>'`

Let’s break this command syntax down using one of the commands we just typed.

1	`set` `ToyStore.Toys['Transformer']['Price'] =` `'29.99'`

According to our command syntax, the command we typed meant this:

ksp = KeySpace = “ToyStore”
cf = Column Family = “Toys”
key = Row Key (an indexed key which links multiple columns) = “Transformer”
col = Single Column Name (the name in a single name/value pair) = “Price”
val = Single Column Value (the value in a single name/value pair) = “29.99″

These six commands created a total of three rows in the “Toys” column family: “Transformer”, “GumDrop” and “MatchboxCar”. Within each row you created two columns: “Section” and “Price”. Sketched out in a diagram the data you inserted would look something like this:

Within the “Toys” column family, you could also represent this data in JSON like this:

{

  "Transformer" : {

    "Price" : "29.99",

    "Section" : "Action Figures"

  }

  "GumDrop" : {

    "Price" : "0.25",

    "Section" : "Candy"

  }

  "MatchboxCar" : {

    "Price" : "1.49",

    "Section" : "Vehicles"

  }

}

Starting to make sense? Now, let’s try to pull this data back.

Retrieve Data From An Existing Keyspace on Cassandra
Let’s start by counting the number of name/value pairs (i.e., “columns”) stored under one of the row keys we just inserted.

cassandra> count ToyStore.Toys['GumDrop']
2 columns

If you followed directions, the answer will be “2 columns”, whether you use “GumDrop”, “Transformer” or “MatchboxCar” as your column key.
Now try spelling out the row key in all lowercase.

cassandra> count ToyStore.Toys['gumdrop']
0 columns

Yes, Cassandra row keys are case-sensitive. Consider yourself warned, especially if you’re coming from a database environment where cases are insensitive.

Now trying spelling out the row key that doesn’t exist.

cassandra> count ToyStore.Toys['RedMatterBall']
0 columns

Notice that you didn’t get a “no column exists” error on your count statement; instead you were simply told that zero name/value pairs exist for your non-existent row key.
Now that you know to be careful with the exact name and case of your row keys, let’s pull back the data in a particular row instead of just counting how many columns it contains. To do this, use the “get” command as shown below.

cassandra> get ToyStore.Toys['GumDrop']
=> (column=Section, value=Candy, timestamp=1278132493790000)
=> (column=Price, value=0.25, timestamp=1278132306875000)
Returned 2 results.

The two “column” and “value” entries look familiar but there’s a third item in each of our columns: “timestamp”. That value represents the time when you made each column entry. Timestamp may not mean much to us yet (we will safely ignore it for another article or two), but timestamp will mean a great deal to us when we start merging column inserts/updates from two or more Cassandra database nodes.
By the way, here’s how you could represent the timestamp on each column in your diagram:

But back to our data retrieval task. Before we move on, try at least one row key that doesn’t exist.

cassandra> get ToyStore.Toys['RedMatterBall']
Returned 0 results.

Again, note that Cassandra reports that there are “0 results” for this row key, not that this row key doesn’t exist.
The last thing we’re going to do in this article is drill down into an existing row and only pick out one column (i.e., one name/value pair).

cassandra> get ToyStore.Toys['GumDrop']['Price']
=> (column=Price, value=0.25, timestamp=1278132306875000)

Now try this with a valid row key and an invalid column.

cassandra> get ToyStore.Toys['GumDrop']['Taste']
Exception null

This time we got an error rather than a “count of zero” message!
Relational database folks, are you starting to see the pattern? (Hint: Using non-existent row keys is like executing a “SELECT COUNT(*) FROM DB” with a WHERE clause that matches nothing, but using non-existent column names is like executing a query with invalid fields.)

Hadoop installation in Pseudo distributed mode tutorial

This document covers the Steps to

1) Configure SSH

2) Install JDK

3) Install Hadoop

Update your repository

#sudo apt-get update

Hadoop use SSH to prove the identity for connection.

Let's Download and configure SSH

#sudo apt-get install openssh-server openssh-client

#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#sudo chmod go-w $HOME $HOME/.ssh

#sudo chmod 600 $HOME/.ssh/authorized_keys

#sudo chown `whoami` $HOME/.ssh/authorized_keys

Testing your SSH

#ssh localhost

Say yes

It should open connection with SSH

#exit

This will close the SSH

Java 1.6 is mandatory for running hadoop

Lets Download and install JDK

#sudo mkdir /usr/java

#cd /usr/java

#sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin

Wait till the jdk download completes

Install java

#sudo chmod o+w jdk-6u31-linux-i586.bin

#sudo chmod +x jdk-6u31-linux-i586.bin

#sudo ./jdk-6u31-linux-i586.bin

Now comes the Hadoop :)

Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.

Download the latest hadoop version from its website

http://hadoop.apache.org/common/releases.html

Download hadoop 1.0.x tar.gz from hadoop website

Extract it into some folder ( say /home/hadoop/software/20/ )

All softwares have been downloaded at that location

Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

<name>fs.default.name</name>

<value>hdfs://localhost</value>

</property>

</configuration>

Similarly do for

conf/hdfs-site.xml:

<name>dfs.replication</name>

</property>

</configuration>

conf/mapred-site.xml:

<name>mapred.job.tracker</name>

<value>localhost:8021</value>

</property>

</configuration>

Environment variables

In hadoop_env.sh file , change the JAVA_HOME to location where you installed java

e.g

JAVA_HOME = /usr/java/jdk1.6.0_31

Configure the environment variables for JDK , Hadoop as follows

Go to ~.profile file in the current user home directory

Add the following

You can change the variable paths if you have installed hadoop and java at some other locations

export JAVA_HOME="/usr/java/jdk1.6.0_31"

export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"

export PATH=$PATH:$HADOOP_INSTALL/bin

Testing your installation

Format the HDFS

# hadoop namenode -format

hadoop@jj-VirtualBox:~$ start-dfs.sh

starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out

localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out

localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out

hadoop@jj-VirtualBox:~$ start-mapred.sh

starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out

localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out

Open the browser and point to page

localhost:50030

localhost:50070

It would open the status page for hadoop

Thats it , this completes the installation of Hadoop , now you are ready to play with it.

Hadoop course

1. An Introduction To Hadoop And HDFS

Why Hadoop?
HDFS
MapReduce
Hive, Pig, HBase and other ecosystem projects
Hands-On Exercise: Installing a pseudo-distributed cluster

2. Planning Your Hadoop Cluster

General Planning Considerations
Choosing The Right Hardware
Node Topologies
Choosing The Right Software

3. Deploying Your Cluster

Installing Hadoop
Using SCM Express for easy installation
Typical Configuration Parameters
Configuring Rack Awareness
Using Configuration Management Tools
Hands-On Exercise: Installing a Hadoop Cluster

4. Cluster Maintenance

Checking HDFS with fsck
Hands-On Exercise: Breaking the Cluster
Copying data with distcp
Rebalancing cluster nodes
Adding and removing cluster nodes
Hands-On Exercise: Verifying the Cluster's Self-Healing Features
Backup And Restore
Upgrading and Migrating
Hands-On Exercise: Backing Up and Restoring the NameNode Metadata

5. Cloudera Certified Administrator Exam

Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Admininstrator exam

6. Managing and Scheduling Jobs

Starting and stopping MapReduce jobs
Hands-On Exercise: Managing jobs
The FIFO Scheduler
The Fair Scheduler
Hands-On Exercise: Using the FairScheduler

7. Installing And Managing Other Hadoop Projects

Hive
Pig
HBase
Hands-On Exercise: Configuring the Hive Shared Metastore

8. Populating HDFS From External Sources

Using Sqoop
Using Flume
Best Practices for Data Ingestion

9. Cluster Monitoring, Troubleshooting and Optimizing

Hadoop Log Files
Using the NameNode and JobTracker Web UIs
Interpreting Job Logs
Monitoring with Ganglia
Other monitoring tools
General Optimization Tips
Benchmarking Your Cluster

As a developer we should know the following

1. The Motivation For Hadoop

    Problems with traditional large-scale systems
    Requirements for a new approach

2. Hadoop: Basic Concepts

    An Overview of Hadoop
    The Hadoop Distributed File System
    Hands-On Exercise
    How MapReduce Works
    Hands-On Exercise
    Anatomy of a Hadoop Cluster
    Other Hadoop Ecosystem Components

3. Writing a MapReduce Program

    The MapReduce Flow
    Examining a Sample MapReduce Program
    Basic MapReduce API Concepts
    The Driver Code
    The Mapper
    The Reducer
    Hadoop’s Streaming API
    Using Eclipse for Rapid Development
    Hands-on exercise

4. Integrating Hadoop Into The Workflow

    Relational Database Management Systems
    Storage Systems
    Importing Data from RDBMSs With Sqoop
    Hands-On Exercise
    Importing Real-Time Data with Flume
    Accessing HDFS Using FuseDFS and Hoop

5. More Advanced MapReduce Programming

    Custom Writables and WritableComparables
    Saving Binary Data using SequenceFiles and Avro Files
    Creating InputFormats and OutputFormats
    Hands-on exercise

6. Graph Manipulation in Hadoop

    Introduction to graph techniques Representing graphs in Hadoop Implementing a sample algorithm: Single Source Shortest Path

7. Cloudera Certified Hadoop Developer Exam

    Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Developer exam

8. Using Hive and Pig

    Hive Basics Pig Basics Hands-on exercise

9. Delving Deeper Into The Hadoop API

    Using LocalJobRunner Mode for Faster Development Reducing Intermediate Data With Combiners The configure and close methods for Map/Reduce Setup and Teardown Writing Partitioners for Better Load Balancing Directly Accessing HDFS Using the Distributed Cache Hands-On Exercise

10. Practical Development Tips and Techniques

Testing with MRUnit Debugging MapReduce Code Using LocalJobRunner Mode For Easier Debugging Retrieving Job Information with Counters Logging Splittable File Formats Determining the Optimal Number of Reducers Map-Only MapReduce Jobs Implementing Multiple Mappers using ChainMapper Hands-On Exercise

11. Common MapReduce Algorithms

Sorting and Searching Indexing Machine Learning With Mahout Term Frequency – Inverse Document Frequency Word Co-Occurrence Hands-On Exercise

12. Joining Data Sets in MapReduce Jobs

Map-Side Joins The Secondary Sort Reduce-Side Joins Hands-On Exercise

13. Creating Workflows with Oozie

The Motivation for Oozie Oozie's Workflow Definition Format Hands-On Exercise

Syllabus guidelines for Developer exam

Core Hadoop Concepts

Storing Files in Hadoop

Job Configuration and Submission

Job Execution Environment

Input and Output

Job Lifecycle

Data processing

Key and Value Types

Common Algorithms and Design Patterns

The Hadoop Ecosystem

Syllabus guidelines for Admin exam

Apache Hadoop Cluster Core Technologies

Apache Hadoop Cluster Planning

Apache Hadoop Cluster Management

Job Scheduling

Monitoring and Logging

refer

http://www.philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/
http://www.cs.bgu.ac.il/~dsp112/The_Map-Reduce_Pattern
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/