Monday, 4 June 2012

What’s the Difference Between a a SuperColumn and a SubColumn in Cassandra?

First, remember that in Cassandra terminology, “subcolumn” = “supercolumn” = “sub column” = “supercolumn”.
With that in mind, a “super column family” is really just a “column family…that contains super columns under its rows”.  (As opposed to a regular “column family” that merely contains rows without supercolumns.)


The confusion comes about because “super column family” entries look like this:

1
2
3
4
<ColumnFamily Name="Super1"
              ColumnType="Super"
              CompareWith="BytesType"
              CompareSubcolumnsWith="BytesType" />
..and plain old “column family” entries look like this:

1
2
<ColumnFamily Name="Regular1"
              CompareWith="BytesType" />

…both use a tag named “ColumnFamily” in Cassandra’s “storage-conf.xml” definition file.
Personally, I prefer using the term “Column Family” to cover both column families with rows that contain supercolumns as well as column families with rows that don’t contain supercolumns.  But if someone uses the term “super column family” they always mean “a column family that contains rows that contain supercolumns.”

This article covers the difference between a supercolumn and a subcolumn in Cassandra.
Let me cut to the chase: there is no difference.  They are two terms for exactly the same thing.
If you are familiar with a typical keystore->column family->row->super column->column structure, such as the one pictured below, then you could safely replace all instances of the phrase “super column” with “subcolumn” without changing the meaning.

The confusion around “super column” vs. “sub column” is fueled largely by the Cassandra configuration file.  In your “storage-conf.xml” file you will see XML “ColumnFamily” configuration elements like this:

1
2
3
4
<ColumnFamily Name="Super1"
              ColumnType="Super"
              CompareWith="BytesType"
              CompareSubcolumnsWith="BytesType" />
If this was was a plain old “ColumnFamily” entry, you would only see this:

1
2
<ColumnFamily Name="Regular1"
              CompareWith="BytesType" />
…but this is a “Super Column Family”, so there are two extra attributes:
  • ColumnType=”Super” to tell Cassandra that this column family will contain super columns.
  • CompareSubcolumnsWith=”BytesType” to tell Cassandra that our sub columns will be sorted through bit-by-bit comparison.
Confused?  If so, go back and read the last two bullets again while telling yourself:
“super column = sub column = supercolumn = subcolumn…”

Introduction to Cassandra Columns, Super Columns and Rows

This article provides new users the basics they need to understand Cassandra’s “column / super column / row” data model.
Though the focus is not on mechanics, this article assumes you are familiar with adding columns to and requesting data from existing keyspaces on Cassandra.
Remember that a Cassandra column is basically a “name=value” pair* (e.g., “color=red”).  You can use multiple columns to represent data such as

1
2
"Price" : "29.99",
"Section" : "Action Figures"

JSON representation is
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}
The keys used to group related columns into rows in this example were “Transformer”, “GumDrop” and “MatchboxCar”.


In JSON, this keystore->column family->row->column data structure would be represented like this:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "ToyStore" : {
    "Toys" : {
      "GumDrop" : {
        "Price" : "0.25",
        "Section" : "Candy"
      }
      "Transformer" : {
        "Price" : "29.99",
        "Section" : "Action Figures"
      }
      "MatchboxCar" : {
        "Price" : "1.49",
        "Section" : "Vehicles"
      }
    }
  },
  "Keyspace1" : null,
  "system" : null
}
If you simply wanted to add other types of unrelated collections of information (e.g., “BugCollection” or “PaintColors”), you’d simply keep adding new keyspaces for each new collection.  However, if you needed to keep track of similar collections of data (e.g., your Ohio and New York toy stores instead of a single toy store) you’d need to turn to a different kind of Cassandra element: the “super column”.
To see super columns in action, inspect this keystore->column family->row->super column->column data structure as it appears in JSON:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "ToyCorporation" : {
    "ToyStores" : {
      "Ohio Store" : {
        "Transformer" : {
          "Price" : "29.99",
          "Section" : "Action Figures"
        }
        "GumDrop" : {
          "Price" : "0.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "1.49",
          "Section" : "Vehicles"
        }
      }
      "New York Store" : {
        "JawBreaker" : {
          "Price" : "4.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "8.79",
          "Section" : "Vehicles"
        }
      }
    }
  }
}
This data could also be visualized like this:

Given its late appearance, you might expect that “Ohio Store” and “New York Store” would represent super columns that span multiple rows.   However, the opposite is true:  “Ohio Store” and “New York Store” are now the row keys and entries like “Transformer”, “GumDrop” and “MatchboxCar” have become super columns keys.
Like column keys, super column keys are indexed and sorted by a specific type (e.g., “UTF8Type”, ”AsciiType”, “LongType”, “BytesType”, etc.).    However, like row keys, super column entries have no values of their own; they are simply used to collect other columns.
Notice that the keys of the two groups of super columns do not match.  ({“Transformer”, “GumDrop”, “MatchboxCar”} does not match {“JawBreaker”, “MatchboxCar”}. )  This is not an error: super column keys in different rows do not have to match and often will not.

Migrate a Relational Database Structure into a JSON

JSON stands for “JavaScript Object Notation” and is an efficient way to transfer complex information about specific entities between two separate programs.
As the “JavaScript” name implies, JSON is often used to transfer information between JavaScript-interpreting web browsers and JSON-aware web applications.  In fact, native understanding of JSON is now built into most web browsers’ JavaScript interpreters.

The Original Relational Database Structure
We are going to start with a very simple 1:N relational database structure. Our first two tables are “forests” and “famoustrees”.  Here is our data in tabular format:

forests:


famoustrees:



“famoustrees” is linked to “forests” using the “forestID” foreign key.  Notice that there are no famous trees in the “Lonely Grove” forest, one famous tree in the “100 Acre Woods” and two famous trees in the “Black Forest”.
If we were to represent the data in our database – call it our “biologicalfeatures” database – in JSON, it would look like this:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
        },
      "forest045" :
        {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
        },
      "forest127" :
        {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
        }
      },
    "famoustrees" :
      {
      "tree12345" :
        {
          "forestID" : "forest003",
          "name" : "Der Tree",
          "species" : "Red Oak"
        },
      "tree12399" :
        {
          "forestID" : "forest045",
          "name" : "Happy Hunny Tree",
          "species" : "Willow"
        },
      "tree32345" :
        {
          "forestID" : "forest003",
          "name" : "Das Ubertree",
          "species" : "Blue Spruce"
        }
      }
    }
}
Denormalizing the Tables
To collapse the famoustrees table into our forests table, we need to move each famoustree entry underneath its forest entry.  We can also also remove the foreign “forestID” key from each famoustree entry – we don’t need that anymore.

However, we should retain the type of each famoustree entry we moved into the forest entry.  We can do this by adding an extra “type” value to each entry.

Finally, we could break out the original non-ID information in each forest entry into a typed section too.  We’ll tag each of these sections with a new ID of “generalinfo”.  (This is a Cassandra-friendly convention – we’ll get into this more below.)

Represented in JSON, our data now looks like this:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
        "generalinfo" :
          {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
          },
        "tree12345" :
          {
            "type" : "famoustree",
            "name" : "Der Tree",
            "species" : "Red Oak"
          },
        "tree32345" :
          {
            "type" : "famoustree",
            "name" : "Das Ubertree",
            "species" : "Blue Spruce"
          }
        },
      "forest045" :
        {
        "generalinfo" :
          {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
          },
        "tree12399" :
          {
            "type" : "famoustree",
            "name" : "Happy Hunny Tree",
            "species" : "Willow"
          }
        },
      "forest127" :
        {
        "generalinfo" :
          {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
          }
        }
      }
    }
}
Ready for Cassandra?
There are really only two types of JSON data structures that can be imported directly into Cassandra.  One is the
keystore->columnfamily->rowkey->column
data structure shown below:
?
1
2
3
4
5
6
7
8
9
10
11
12
{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
          "column name" : "column value"
        }
      }
    }
}
Add another layer and you get the other supported data structure
keystore->columnfamily (a.k.a. “supercolumnfamily”)->rowkey->supercolumn (a.k.a. “subcolumn”)->column
shown below:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
        "supercolumn" :
          {
          "column name" : "column value"
          }
        }
      }
    }
}
That’s it: if you can get your data to fit into one of those two JSON structures, your data is ready to be input into Cassandra.
You probably suspect that I wouldn’t have taken you this far if our forests data wasn’t ready for Cassandra, but please take a moment to scroll up and see if you can figure out whether our denormalized forests data uses supercolumns or not.
Let’s break it down:
biologicalfeatures -> forests
…matches the keystore->columnfamily structure used by both supported JSON structures.
As for the rest:
forest003 -> generalinfo -> (name=”Black Forest”)
…matches the rowkey->supercolumn->column structure used by the “supercolumn” supported JSON structure.
So, yes, we had to use supercolumns to denormalize the forests and famoustrees tables properly.

How to Add and Retrieve Data from a Cassandra Database

This article describes how to create a new keyspace on a Cassandra database server, how to add data to that keyspace and how to run some simple queries against that data.


Create a New Keyspace on Cassandra
First, sign on to your Cassandra server using the “cassandra-cli” client.  Use the “show keyspaces” command to ensure you have a live connection to the server and to make sure the keyspace you are about to add doesn’t already exist.
cassandra> show keyspaces
Keyspace1
system
These two keyspaces are automatically installed when you installed Cassandra and are completely independent of one another – like separate databases on a relational database system would be.  A diagram of two separate keyspaces in our Cassandra database would look like this:

We want to add a new keyspace called “ToyStore”.  Once we’re done, we’d expect our diagram to look like this:

To create an new, empty keyspace called “ToyStore” Type
 CREATE KEYSPACE ToyStore with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'and strategy_options = [{replication_factor:1}];

Use ToyStore ;

 This will create a “column family”  (if you’re a relational database user think “table” for now) called “Toys”.
The “Comparator” attribute in the “ColumnFamily” tag you just added controls how row keys are indexed and sorted.  The “UTF8Type” value key indicates that you’re indexing by UTF8 characters.  Other possible values include”AsciiType”, “LongType” (64-bit long integers) and “BytesType” (straight bit-to-bit comparison – the default value).

CREATE COLUMN FAMILY Toys
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: name, validation_class: UTF8Type}
{column_name: price, validation_class: LongType}
];


Add Data To An Existing Keyspace on Cassandra
Now that we have a new “ToyStore” keyspace it’s time to add some data.  If you were watching closely you’ll notice that we did more than add a keystore in the previous step: we added our first “column family” too.  (Think “table” if you’re coming from a relational database background.)

To get started adding data, restart your Cassandra client and use the following syntax to add six name/value pairs to the “Toys” column family of your new “ToyStore” keyspace.
cassandra> set ToyStore.Toys['Transformer']['Price'] = ’29.99′
Value inserted.
cassandra> set ToyStore.Toys['GumDrop']['Price'] = ’0.25′
Value inserted.
cassandra> set ToyStore.Toys['MatchboxCar']['Price'] = ’1.49′
Value inserted.
cassandra> set ToyStore.Toys['Transformer']['Section'] = ‘Action Figures’
Value inserted.
cassandra> set ToyStore.Toys['GumDrop']['Section'] = ‘Candy’
Value inserted.
cassandra> set ToyStore.Toys['MatchboxCar']['Section'] = ‘Vehicles’
Value inserted.
If you run a “help” command from the Cassandra client you will see the following syntax for the kind of “set” command we just used:
?
1
set <ksp>.<cf>['<key>']['<col>'] = '<value>'
Let’s break this command syntax down using one of the commands we just typed.
?
1
set ToyStore.Toys['Transformer']['Price'] = '29.99'
According to our command syntax, the command we typed meant this:
  • ksp = KeySpace = “ToyStore”
  • cf = Column Family = “Toys”
  • key = Row Key (an indexed key which links multiple columns) = “Transformer”
  • col = Single Column Name (the name in a single name/value pair) = “Price”
  • val = Single Column Value (the value in a single name/value pair) = “29.99″

These six commands created a total of three rows in the “Toys” column family: “Transformer”, “GumDrop” and “MatchboxCar”.  Within each row you created two columns: “Section” and “Price”.   Sketched out in a diagram the data you inserted would look something like this:

Within the “Toys” column family, you could also represent this data in JSON like this:
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}
Starting to make sense? Now, let’s try to pull this data back.

Retrieve Data From An Existing Keyspace on Cassandra
Let’s start by counting the number of name/value pairs (i.e., “columns”) stored under one of the row keys we just inserted.
cassandra> count ToyStore.Toys['GumDrop']
2 columns

If you followed directions, the answer will be “2 columns”, whether you use “GumDrop”, “Transformer” or “MatchboxCar” as your column key.
Now try spelling out the row key in all lowercase.

cassandra> count ToyStore.Toys['gumdrop']
0 columns
Yes, Cassandra row keys are case-sensitive. Consider yourself warned, especially if you’re coming from a database environment where cases are insensitive.

Now trying spelling out the row key that doesn’t exist.
cassandra> count ToyStore.Toys['RedMatterBall']
0 columns
Notice that you didn’t get a “no column exists” error on your count statement; instead you were simply told that zero name/value pairs exist for your non-existent row key.
Now that you know to be careful with the exact name and case of your row keys, let’s pull back the data in a particular row instead of just counting how many columns it contains. To do this, use the “get” command as shown below.
cassandra> get ToyStore.Toys['GumDrop']
=> (column=Section, value=Candy, timestamp=1278132493790000)
=> (column=Price, value=0.25, timestamp=1278132306875000)
Returned 2 results.
The two “column” and “value” entries look familiar but there’s a third item in each of our columns: “timestamp”. That value represents the time when you made each column entry. Timestamp may not mean much to us yet (we will safely ignore it for another article or two), but timestamp will mean a great deal to us when we start merging column inserts/updates from two or more Cassandra database nodes.
By the way, here’s how you could represent the timestamp on each column in your diagram:

But back to our data retrieval task. Before we move on, try at least one row key that doesn’t exist.
cassandra> get ToyStore.Toys['RedMatterBall']
Returned 0 results.
Again, note that Cassandra reports that there are “0 results” for this row key, not that this row key doesn’t exist.
The last thing we’re going to do in this article is drill down into an existing row and only pick out one column (i.e., one name/value pair).
cassandra> get ToyStore.Toys['GumDrop']['Price']
=> (column=Price, value=0.25, timestamp=1278132306875000)
Now try this with a valid row key and an invalid column.
cassandra> get ToyStore.Toys['GumDrop']['Taste']
Exception null
This time we got an error rather than a “count of zero” message!
Relational database folks, are you starting to see the pattern? (Hint: Using non-existent row keys is like executing a “SELECT COUNT(*) FROM DB” with a WHERE clause that matches nothing, but using non-existent column names is like executing a query with invalid fields.)

Hadoop installation in Pseudo distributed mode tutorial



This document covers the Steps to
1) Configure SSH
2) Install JDK
3) Install Hadoop

Update your repository
#sudo apt-get update

Hadoop use SSH to prove the identity for connection.
Let's Download and configure SSH
#sudo apt-get install openssh-server openssh-client
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
#sudo chmod go-w $HOME $HOME/.ssh
#sudo chmod 600 $HOME/.ssh/authorized_keys
#sudo chown `whoami` $HOME/.ssh/authorized_keys

Testing your SSH
#ssh localhost
Say yes
It should open connection with SSH
#exit
This will close the SSH

Java 1.6 is mandatory for running hadoop
Lets Download and install JDK
#sudo mkdir /usr/java
#cd /usr/java
#sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin

Wait till the jdk download completes
Install java
#sudo chmod o+w jdk-6u31-linux-i586.bin
#sudo chmod +x jdk-6u31-linux-i586.bin
#sudo ./jdk-6u31-linux-i586.bin

Now comes the Hadoop :)
Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.
Download the latest hadoop version from its website
http://hadoop.apache.org/common/releases.html

Download hadoop 1.0.x tar.gz from hadoop website
Extract it into some folder ( say /home/hadoop/software/20/ )

All softwares have been downloaded at that location


Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>

Similarly do for

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Environment variables
In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
e.g

JAVA_HOME = /usr/java/jdk1.6.0_31

Configure the environment variables for JDK , Hadoop as follows
Go to ~.profile file in the current user home directory
Add the following
You can change the variable paths if you have installed hadoop and java at some other locations

export JAVA_HOME="/usr/java/jdk1.6.0_31"
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"
export PATH=$PATH:$HADOOP_INSTALL/bin

Testing your installation
Format the HDFS
# hadoop namenode -format

hadoop@jj-VirtualBox:~$ start-dfs.sh
starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out

localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out

localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out

hadoop@jj-VirtualBox:~$ start-mapred.sh

starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out

localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out

Open the browser and point to page
localhost:50030
localhost:50070
It would open the status page for hadoop
Thats it , this completes the installation of Hadoop , now you are ready to play with it.

Hadoop course

1. An Introduction To Hadoop And HDFS
  • Why Hadoop?
  • HDFS
  • MapReduce
  • Hive, Pig, HBase and other ecosystem projects
  • Hands-On Exercise: Installing a pseudo-distributed cluster
2. Planning Your Hadoop Cluster
  • General Planning Considerations
  • Choosing The Right Hardware
  • Node Topologies
  • Choosing The Right Software
3. Deploying Your Cluster
  • Installing Hadoop
  • Using SCM Express for easy installation
  • Typical Configuration Parameters
  • Configuring Rack Awareness
  • Using Configuration Management Tools
  • Hands-On Exercise: Installing a Hadoop Cluster
4. Cluster Maintenance
  • Checking HDFS with fsck
  • Hands-On Exercise: Breaking the Cluster
  • Copying data with distcp
  • Rebalancing cluster nodes
  • Adding and removing cluster nodes
  • Hands-On Exercise: Verifying the Cluster's Self-Healing Features
  • Backup And Restore
  • Upgrading and Migrating
  • Hands-On Exercise: Backing Up and Restoring the NameNode Metadata
5. Cloudera Certified Administrator Exam
  • Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Admininstrator exam
6. Managing and Scheduling Jobs
  • Starting and stopping MapReduce jobs
  • Hands-On Exercise: Managing jobs
  • The FIFO Scheduler
  • The Fair Scheduler
  • Hands-On Exercise: Using the FairScheduler
7. Installing And Managing Other Hadoop Projects
  • Hive
  • Pig
  • HBase
  • Hands-On Exercise: Configuring the Hive Shared Metastore
8. Populating HDFS From External Sources
  • Using Sqoop
  • Using Flume
  • Best Practices for Data Ingestion
9. Cluster Monitoring, Troubleshooting and Optimizing
  • Hadoop Log Files
  • Using the NameNode and JobTracker Web UIs
  • Interpreting Job Logs
  • Monitoring with Ganglia
  • Other monitoring tools
  • General Optimization Tips
  • Benchmarking Your Cluster

As a developer we should know the following


1. The Motivation For Hadoop

    Problems with traditional large-scale systems
    Requirements for a new approach

2.  Hadoop: Basic Concepts

    An Overview of Hadoop
    The Hadoop Distributed File System
    Hands-On Exercise
    How MapReduce Works
    Hands-On Exercise
    Anatomy of a Hadoop Cluster
    Other Hadoop Ecosystem Components

3. Writing a MapReduce Program

    The MapReduce Flow
    Examining a Sample MapReduce Program
    Basic MapReduce API Concepts
    The Driver Code
    The Mapper
    The Reducer
    Hadoop’s Streaming API
    Using Eclipse for Rapid Development
    Hands-on exercise

4. Integrating Hadoop Into The Workflow

    Relational Database Management Systems
    Storage Systems
    Importing Data from RDBMSs With Sqoop
    Hands-On Exercise
    Importing Real-Time Data with Flume
    Accessing HDFS Using FuseDFS and Hoop

5. More Advanced MapReduce Programming

    Custom Writables and WritableComparables
    Saving Binary Data using SequenceFiles and Avro Files
    Creating InputFormats and OutputFormats
    Hands-on exercise

6. Graph Manipulation in Hadoop

    Introduction to graph techniques Representing graphs in Hadoop Implementing a sample algorithm: Single Source Shortest Path

7. Cloudera Certified Hadoop Developer Exam

    Following the training, attendees will have an opportunity to take the Cloudera Certified Hadoop Developer exam

8. Using Hive and Pig

    Hive Basics Pig Basics Hands-on exercise

9. Delving Deeper Into The Hadoop API

    Using LocalJobRunner Mode for Faster Development Reducing Intermediate Data With Combiners The configure and close methods for Map/Reduce Setup and Teardown Writing Partitioners for Better Load Balancing Directly Accessing HDFS Using the Distributed Cache Hands-On Exercise

10. Practical Development Tips and Techniques

Testing with MRUnit Debugging MapReduce Code Using LocalJobRunner Mode For Easier Debugging Retrieving Job Information with Counters Logging Splittable File Formats Determining the Optimal Number of Reducers Map-Only MapReduce Jobs Implementing Multiple Mappers using ChainMapper Hands-On Exercise

11. Common MapReduce Algorithms

Sorting and Searching Indexing Machine Learning With Mahout Term Frequency – Inverse Document Frequency Word Co-Occurrence Hands-On Exercise

12. Joining Data Sets in MapReduce Jobs

Map-Side Joins The Secondary Sort Reduce-Side Joins Hands-On Exercise

13. Creating Workflows with Oozie

The Motivation for Oozie Oozie's Workflow Definition Format Hands-On Exercise


Syllabus guidelines for Developer exam
    Core Hadoop Concepts
    Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing. Understand how Apache Hadoop exploits data locality. Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.
    Storing Files in Hadoop
    Analyze the benefits and challenges of the HDFS architecture, including how HDFS implements file sizes, block sizes, and block abstraction. Understand default replication values and storage requirements for replication. Determine how HDFS stores, reads, and writes files. Given a sample architecture, determine how HDFS handles hardware failure.
    Job Configuration and Submission
    Construct proper job configuration parameters, including using JobConf and appropriate properties. Identify the correct procedures for MapReduce job submission. How to use various commands in job submission (“hadoop jar” etc.)
    Job Execution Environment
    Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer. Understand the key fault tolerance principles at work in a MapReduce job. Identify the role of Apache Hadoop Classes, Interfaces, and Methods. Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.
    Input and Output
    Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements. Understand the role of the RecordReader, and of sequence files and compression.
    Job Lifecycle
    Analyze the order of operations in a MapReduce job, how data moves from place to place, how partitioners and combiners function, and the sort and shuffle process.
    Data processing
    Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values. Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).
    Key and Value Types
    Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job. Understand common key and value types in the MapReduce framework and the interfaces they implement.
    Common Algorithms and Design Patterns
    Evaluate whether an algorithm is well-suited for expression in MapReduce. Understand implementation and limitations and strategies for joining datasets in MapReduce. Analyze the role of DistributedCache and Counters.
    The Hadoop Ecosystem
    Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie. Understand how Hadoop Streaming might apply to a job workflow.

Syllabus guidelines for Admin exam
    Apache Hadoop Cluster Core Technologies
    Daemons and normal operation of an Apache Hadoop cluster, both in data storage and in data processing. The current features of computing systems that motivate a system like Apache Hadoop.
    Apache Hadoop Cluster Planning
    Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
    Apache Hadoop Cluster Management
    Cluster handling of disk and machine failures. Regular tools for monitoring and managing the Apache Hadoop file system
    Job Scheduling
    How the default FIFO scheduler and the FairScheduler handle the tasks in a mix of jobs running on a cluster.
    Monitoring and Logging
    Functions and features of Apache Hadoop’s logging and monitoring systems.

refer

http://www.philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/
 http://www.cs.bgu.ac.il/~dsp112/The_Map-Reduce_Pattern
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/