Techie Talks

Friday, 22 June 2012

Unable to execute protect excel task in sql server 2008

Configure dcomconfig by following this process
Control panel>administrative tools>component services > Computers > My Computer > DCOM Config + Microsoft Excel application > properties > Identity Tab > Select the Interactive User > Ok.

If you are unable to find microsoft excel application in DCOM config then follow this process

On 64 bit system with 32 bit Office try this:

Start
Run
mmc -32
File
Add Remove Snap-in
Component Services
Add
OK
Console Root
Component Services
Computers
My Computer
DCOM Config
Microsoft Excel Application
Properties
Identity
Set to Interactive user

Friday, 8 June 2012

Interview questions links

http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html
http://hadoop-interview-questions.blogspot.in/

Thursday, 7 June 2012

Install hive

Step 1: Enable multiverse repo and get packages
The first thing we need to do is make sure we've got multiverse repos installed. Using your favorite editor (vi) add these lines to your etc/apt/sources.list:

deb http://us.archive.ubuntu.com/ubuntu/ lucid multiverse
deb-src http://us.archive.ubuntu.com/ubuntu/ lucid multiverse
deb http://us.archive.ubuntu.com/ubuntu/ lucid-updates multiverse
deb-src http://us.archive.ubuntu.com/ubuntu/ lucid-updates multiverse

With that done, go ahead and update your copy and install the subversion, java, and ant packages you'll need to do the install.

sudo apt-get update
sudo apt-get dist-ugprade
sudo apt-get install openjdk-6-jre ant subversion

Step 2: Get Hadoop
The next thing we'll do is grab hadoop. Be sure to get the latest version. For this tutorial we're using 0.20.2

wget http://mirror.its.uidaho.edu/pub/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

We'll move this to /usr/local, untar it, and then rename it. Use any alternate techniques you like here.. (e.g. symlinks, different directories, etc) there's no magic in this step

sudo tar xvzf hadoop-0.20.2.tar.gz
sudo mv hadoop-0.20.2 hadoop
cd hadoop

Once you've extracted it and moved into the directory, find the JAVA_HOME line in the environment script and uncomment it as so

sudo vi conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/

Then type

sudo ant

Finally, when ant is done doing it's thing, remove the build directory

sudo rm -rf /usr/local/hadoop/build

Step 3. Get Hive

From /usr/local let's go ahead and checkout hive using subversion and then build it:

sudo svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
cd hive
sudo ant package

By default hive uses a directory called /user/hive/warehouse You can change that if you like, but for simplicity, we'll just go ahead and create it instead.

sudo mkdir -p /user/hive/warehouse

Step 4: Add the ingredients to your PATH
I'm running hive as root in development but you can add this PATH statement to whatever user has permissions.

export PATH=$PATH:/usr/src/hive/build/dist/bin/
export PATH=$PATH:/usr/src/hive/build/dist/lib/
export PATH=$PATH:/usr/local/hadoop/bin

Once done, log out and log back in (so your path takes hold) and then as root you can launch hive using this command:

hive --service hiveserver

If you get an error about hadoop not being found, make sure you've renamed your hadoop-0.20.2 folder to just hadoop (or used symlinks or whatever)

Wednesday, 6 June 2012

SQOOP

Sqoop Import Examples:

Sqoop Import :- Import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS) and its subprojects (Hive, HBase).

Import the data (MySQL table) to HBase:

Case 1: If table have primary key and import all the column of MySQL table into HBase table.

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table tableName --hbase-table hbase_tableName --column-family hbase_table_col1 --hbase-create-table

Case 2: If table have primary key and import only few columns of MySQL table into HBase table.

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table tableName --hbase-table hbase_tableName --columns column1,column2 --column-family hbase_table_col1 --hbase-create-table

Note : Column names specified in --columns attribute must contain the primary key column.

Case 3: If table doesn't have primary key then choose one column as a hbase-row-key. Import all the column of MySQL table into HBase table.

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table tableName --hbase-table hbase_tableName --column-family hbase_table_col1 --hbase-row-key column1 --hbase-create-table

Case 4: If table doesn't have primary key then choose one column as a hbase-row-key. Import only few columns of MySQL table into HBase table.

Note: Column name specified in hbase-row-key atribute must be in columns list. Otherwise command will execute successfully but no records are inserted into hbase.

Note : The value of primary key column or column specified in --hbase-row-key attribute become the HBase row value. If MySQL table doesn't have primary key or column specified in --hbase-row-key attribute doesn't have unique value then there is a lost of few records.

Example : Let us consider a MySQL table test_table which have two columns name,address. The table test_table doesn't have primary key or unique key column.

Records of test_table:
________________
name    address
----------------
abc    123
sqw    345
abc    125
sdf    1234
aql    23dw

Run the following command to import test_table data into HBase:

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table test_table --hbase-table hbase_test_table --column-family test_table_col1 --hbase-row-key name --hbase-create-table

Only 4 records are visible into HBase table instead of 5. In above example two rows have same value 'abc' of name column and value of this column is used as a HBase row key value. If record having value 'abc' of name column come then thoes record will inserted into HBase table. Next time, another record having the same value 'abc' of name column come then thoes column will overwrite the value previous column.

Above problem also occured if table have composite primary key because the one column from composite key is used as a HBase row key.

Import the data (MySQL table) to Hive

Case 1: Import MySQL table into Hive if table have primary key.

bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --hive-table tableName --create-hive-table --hive-import --hive-home path/to/hive_home

Case 2: Import MySQL table into Hive if table doesn't have primary key.

$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --hive-table tableName --create-hive-table --hive-import --hive-home path/to/hive_home --split-by column_name

or

$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --hive-table tableName --create-hive-table --hive-import --hive-home path/to/hive_home -m 1

Import the data (MySQL table) to HDFS

Case 1: Import MySQL table into HDFS if table have primary key.

$ bin/sqoop import -connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --target-dir /user/me/tableName

Case 2: Import MySQL table into HDFS if table doesn't have primary key.

$ bin/sqoop import -connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --target-dir /user/me/tableName -m 1

Sqoop Export Examples:

Sqoop Export: export the HDFS and its subproject (Hive, HBase) data back into an RDBMS.

Export Hive table back to an RDBMS:

By default, Hive will stored data using ^A as a field delimiter and \n as a row delimiter.

$ bin/sqoop export --connect jdbc:mysql://localhost/test_db --table tableName --export-dir /user/hive/warehouse/tableName --username root --password password -m 1 --input-fields-terminated-by '\001'

where '\001' is octal representation of ^A.

Tuesday, 5 June 2012

Apache Hive Installation

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
Installation of Hive is pretty straigtforward and easy. With least chit-chatting, I will get to business for ya!

Prerequisites

Sun Java 6

Hadoop requires Sun Java 5.0.x. However, Hive wiki mentions a prerequisite of Sun Java 6.0. Thus we will stick to Sun Java 6.0

Hadoop (0.17.x – 0.19.x)

We must have Hadoop already up and running (support for 0.20.x is still under progress – so 0.17.x to 0.19.x is preferable)!
Note:
a) For this tutorial purpose, we will be referring to a Single Node Hadoop installation

SVN

SVN aka Subversion is an open source version control system. Most of the apache projects are hosted over SVN. Thus, its a good idea to have it on your system if not already.
For the current tutorial, you will need it to grab the code out of Hive SVN Repository
Download it from: http://subversion.tigris.org/

Ant

Ant or Apache Ant is a Java-based build tool. In present context, you will need it to build the ‘checked out’ Hive code.
Download it from: http://ant.apache.org/

Downloading and Building Hive

Hive is available via SVN at: http://svn.apache.org/repos/asf/hadoop/hive/trunk

We will first checkout Hive’s code

svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive

This will put Hive trunk’s content (Hive’s development repository) in your local ‘hive’ directory

Now, we will build the downloaded code

cd hive
ant -Dhadoop.version=”<your-hadoop-version>” package

For example

ant -Dhadoop.version=”0.19.2″ package

Your built code is now in build/dist directory

cd build/dist
ls

On ‘ls’ you will see the following content:

README.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)

The “build/dist/” directory is your Hive Installation and moving further we are going to call it Hive Home.

Let us set an environment variable for our Hive Home too:

export HIVE_HOME=<some path>/build/dist

For example

export HIVE_HOME=/data/build/dist

Hadoop Side Changes

Hive uses hadoop that means:

1. you must have hadoop in your path OR
2. export HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform these changes

$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Running Hive

Now, you are all set to run Hive for yourself! Invoke the command line interface (cli) from the shell:

$HIVE_HOME/bin/hive

Monday, 4 June 2012

Cassandra and hadoop :which one is best??

There is a big confusion in every blog on which nosql out of HBase and cassandra is good.I think it depends on Use cases

Data Model

Cassandra

You have a token ring. Each node takes a section of the Ring.
Cassandra lets you decide between A Random Partitioner (hashing) and and Order Preserving Partitioner in order (by key).

If you use RandomPartitioner you can not range scan on keys (only the columns of a key) but you can use OOP and hash yourself.

Hbase

You have a regions. When regions grow large they split into sub regions.
You only have the keys in byte order therefore If you are inserting based on timestamp one region gets overloaded.Randomize your inserts if they are timestamps"
Practical people know you can hash yourself
HBase has an advantage if you want to range scan on keys
People know you most like will build your own secondary indexes as range scan on key is not going to give your every SQL feature you can wet dream of.

Setup

Cassandra

Cassandra is very easy to setup and get working out of the box
People know you are going to have to understand and tune anyway being able to boot up in seconds means not much. Just bringing up a node and letting it auto bootstrap will likely not be what you want

HBase

Hbase lives on top hadoop which is both a blessing and curse
People know setting up hadoop is not easy
People know hbase has many moving parts (nn,2nn,jt (for counts) datanodes task, trackers, zookeepers, hbase, masters
People know its a bigger stack

Cluster size

Cassandra

Cassandra is actually effective at small sizes 1, 3, 5 nodes
You do not get 'decent' scale out till ~7 nodes
People say "cool I can grow into this even at a small shop"

Hbase

Just too much set up and too many components to make sense at less then ~7 nodes
Many people are looking to solve big data problems so node number of nodes being greater then 7 is not an issue for them. They have some iron to throw at the problem.

Scale out

Cassandra

Cassandra allows your to move nodes and have new nodes take sections of the ring
People have noticed that you have to keep the cluster balanced which DOES take administrative work
People know moving from 11-15 nodes is intensive (compaction, stream, cleanup)
People see joins sometime fail and have to do it again (annoyed)

HBase

Regions split and move quite frequent and automatically
People see that they do not always move where you want
People see that automatic moving and splitting fails sometimes. 1/1000000 it might happens, and you may have to do surgery.

Release cycle

Cassandra

The Cassandra code base moves pretty quickly and is pretty agile.
People have waiting a while for critical features 'efficient moves'
There have been some recent releases that were followed very quickly by a bug fix release

Hbase

The hbase code base moves pretty quickly and is pretty agile.
people get worried by a blurring number of jiras per release
People need a map to figure out the hadoop hbase version matrix and what works together. hadoop 0.20 append ?

Consistency

Cassandra

Cassandra allows the user to chose consistency models

Hbase

Offers one consistency model

API & RPC

Cassandra

Thrift and only thrift

Hbase

pure java client

Administration

Cassandra

Admins are happy with one log
Log is under control in terms of events
Administration is more than just a log file and the wiki is skimpy
Best practices are skimpy too.
Schema design info is skimpy too.

hbase & hadoop & zk & ...

People see A LOT of log files and a lot of cryptic hadoop messages
People see this actually takes a team of people!

Facebook messaging & U

Hbase

Makes sense. Many nodes!
Makes sense. Many hadoop/hbase committers!
Makes sense. Really really really big scale!
Haters point out 'still uses memcache db'

Cassandra

At the time did NOT have online schema updates (does now in 7)
see section on scale out!
Try to get them next go around!

žTransactions

Cassandra

Cassandra does not support transactions in the sense that it keeps bundling multiple row updates into one all-or-nothing operation.Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.

However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable.žAll writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.action commit/rollback capabilities.

HBase

Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster.

In Cassandra, writing with a consistency level of ALL means that the data will be written to all N nodes responsible for the particular piece of data, where N is the replication factor, before the client gets a response. In a standard Cassandra configuration, the write goes into an in-memory table and an in-memory log for each node. The log is periodically batch flushed to disk; there is also an option to flush per commit, but this option severely impacts performance. Subsequent reads from any node are strongly consistent and get the most recent update.

In contrast, HBase has only one region server responsible for serving a given piece of data at any one time, and replication is handled on the HDFS layer. A client sends an update to the region server currently responsible for the update key, and the region server responds with an ack as soon as it updates its in-memory data structure and flushes the update to its write-ahead commit log. In older versions of HBase, the log was configured in a similar manner to Cassandra to flush periodically. As a few commenters have pointed out, the default configuration of more recent versions of HBase flush the commit log before acknowledging writes to the client, using group commit to batch flushes across writes for performance. Replication to the N HDFS nodes responsible for the written data still happens asynchronously, however. HBase ensures strong consistency by routing subsequent reads through the same region server and, if a region server goes down, by using a system of locks based on ZooKeeper so that reads take into account the latest update.

Because Cassandra writes data synchronously to all N nodes in this scheme whereas HBase writes data synchronously to only one node, Cassandra is necessarily slower. In this scheme, write latency in Cassandra is essentially bottlenecked by the slowest machine and subject to variance in network speeds, IO speeds, and CPU loads across machines. HBase pays a disk cost for its forced log sync, but in high throughput environments, group commit amortizes the disk cost across concurrent requests.

The tradeoff comes in availability. Because only the write-ahead log has been replicated to the other HDFS nodes, if the region server that accepted the write fails, the ranges of data it was serving will be temporarily unavailable until a new server is assigned and the log is replayed. On the other hand, Cassandra will still have and serve the data (given the read level of ONE) even if N-1 nodes responsible for the data go down.

Let me cap this by saying both products are awesome and have many awesome people heading them up..THERE ARE MANY CASES WHERE I WOULD CHOSE ONE OR THE OTHER!

Why NOSQL for logging

There were three problems with the use of a relational database table for logging.

First, there was a blocking insert problem. Whenever something noteworthy happened on my systems (e.g., a sign on, a file upload, an administrative configuration change, etc.) I logged it. As long as I didn’t have a busy system things were generally fine, but if a couple of different people hit me with extended periods of rapid file uploads, sign-in/offs from unthrottled API clients then my software would shudder and sometimes thrash.

Second, there was an oversubscription problem, where I added even more load onto the log database by using it heavily for common, interactive queries, such as looking back across the log for recent sign-ons. While that sounds like a good idea because there would only be one authoritative set of records to check, it also magnified the effect of my blocking insert problem. (e.g., if I got hit with a lot of sign-ons, the act of recording the sign-on in the log would block and slow other sign-ons too.)

Finally, the most serious problem occurred when it was time to upgrade. My upgrades often involved a schema change in the log database, and that meant I needed to lock the database and update all the log records – often tens of millions of records. This was too frequently an operation that could take hours and often took 100x more time to complete than all other upgrade operations combined.

So…what should I have done? One answer would have been to look at non-relational NoSQL database technology (such as that available in Apache Cassandra) for my log tables instead. That would have addressed:

the blocking insert problem: nosql databases, especially distributed nosql databases like Cassandra, do not wait for inserts.
the oversubscription problem: without delays due to blocking inserts, the problem of lots of reads waiting on blocking inserts goes away
schema changes: NoSQL datasets support data of various formats, allowing old and new schema data to live next to each other and preventing outages caused by touching all existing data. (The multiple schemas put a little more burden on the application to keep these straight, but it allows the application to handle multiple versions and/or upgrade old ones in the background without downtime.)

.

CREATE KEYSPACE Mathblaster with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'and strategy_options = [{replication_factor:1}];

Use Jumpstart;

CREATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: full_name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
];

CREATE COLUMN FAMILY blog_entry
WITH comparator = TimeUUIDType
AND key_validation_class=UTF8Type
AND default_validation_class = UTF8Type;