Techie Talks: July 2012

Thursday, 26 July 2012

Hive tutorial

CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. Use IF NOT EXISTS to skip the error.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).
You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe.
You must specify a list of a columns for tables that use a native SerDe. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
Use STORED BY to create a non-native table, for example in HBase. See Hive/StorageHandlers for more information on this option.
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can be used to improve performance on certain kinds of queries.
Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
Examples:
Here's an example statement to create a table:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 STORED AS SEQUENCEFILE;

The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;

The above statement lets you create the same table as the previous table.

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE;

In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
In all the examples until now the data is stored in the Hive Metadata Store in the sub-directory page_view.

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '';

You can use the above statement to create a page_view table which points to any hdfs location for its storage. But you still have to make sure that the data is delimited as specified in the query above.

CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile AS
SELECT (KEY % 1024) new_key, concat(KEY, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

The above CTAS statement creates the target table new_key_value_store with the schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the SELECT statement. If the SELECT statement does not specify column aliases, the column names will be automatically assigned to _col0, _col1, and _col2 etc. In addition, the new target table is created using a specific SerDe and a storage format independent of the source tables in the SELECT statement.
Before using this command, the table key_value_store can be created and loaded as follows:

CREATE TABLE key_value_store (
    KEY int,
    value string
)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH 'file:///C:/Users/nkaushik/Desktop/key_value.txt'
INTO TABLE key_value_store;

The file key_value.txt is tab delimited and contains the following entries:

1    :value_1
2    :value_2
3    :value_3
4    :value_4
5    :value_5

CREATE TABLE empty_key_value_store
LIKE key_value_store;

In contrast, the statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.

Inserting Data Into Bucketed Tables

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table -- only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
There is also an example of creating and populating bucketed tables.

Drop Table

DROP TABLE table_name

DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current directory if Trash is configured. The metadata is completely lost.

Dropping an EXTERNAL table, data in the table will NOT be deleted from the file system.
When dropping a table referenced by views, no warning is given
- the views are left dangling as invalid and must be dropped or recreated by the user

See the next section on ALTER TABLE for how to drop partitions.

Alter Table Statements

Alter table statements enable you to change the structure of an existing table. You can add columns/partitions, change serde, add table and SerDe properties, or rename the table itself.

Add Partitions

ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
 
partition_spec:
  : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

You can use ALTER TABLE ADD PARTITION to add partitions to a table. Partition values should be quoted only if they are strings.

ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') 
              location '/path/to/us/part080808' 
              PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';

Drop Partitions

ALTER TABLE table_name DROP partition_spec, partition_spec,...

You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data and metadata for this partition.

ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');

Rename Table

ALTER TABLE table_name RENAME TO new_table_name

This statement lets you change the name of a table to a different name.
Note: a rename on a managed table moves its HDFS location as well as changing the name in the metadata store.

Change Column Name/Type/Position/Comment

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

This command will allow users to change a column's name, data type, comment, or position, or an arbitrary combination of them. Examples:

CREATE TABLE test_change (a int, b int, c int);
 
ALTER TABLE test_change CHANGE a a1 INT; 
    " will change column a's name to a1
 
ALTER TABLE test_change CHANGE a a1 STRING AFTER b;
    " will CHANGE COLUMN a's name to a1, a's DATA type TO string, AND put it after COLUMN b. 
    " The new TABLE's structure is: b int, a1 string, c int;
 
ALTER TABLE test_change CHANGE b b1 INT FIRST;
    " will change column b's name TO b1, AND put it AS the first COLUMN. 
    " The new TABLE's structure is: b1 int, a string, c int;

NOTE: The column change command will only modify Hive's metadata, and will NOT touch the actual data. Users should make sure the actual data layout conforms with the metadata definition.

Add/Replace Columns

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with native SerDe (DynamicSerDe or MetadataTypedColumnsetSerDe). REPLACE COLUMNS can also be used to drop columns. For example:

ALTER TABLE test_change (a int, b int);
      " will remove column 'c' from test_change's schema.

Note that this does not delete underlying data, it just changes the meta data's schema.

Alter Table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_properties
 
table_properties:
  : (property_name = property_value, property_name = property_value, ... )

You can use this statement to add your own metadata to the tables. Currently last_modified_user, last_modified_time properties are automatically added and managed by Hive. Users can add their own properties to this list. You can do DESCRIBE EXTENDED TABLE to get this information.

Add Serde Properties

ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties]
ALTER TABLE table_name SET SERDEPROPERTIES serde_properties
 
serde_properties:
  : (property_name = property_value, property_name = property_value, ... )

This statement enables you to add user defined metadata to table SerDe object. The serde properties are passed to the table's SerDe when it is being initialized by Hive to serialize and deserialize data. So users can store any information required for their custom serde here. Refer to SerDe section of Users Guide for more information.

Alter Table File Format and Organization

ALTER TABLE table_name [partitionSpec] SET FILEFORMAT file_format;
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS

These statements change the table's physical storage properties. For available file_format options, see the section above on CREATE TABLE.
Alter file format can also apply on a partition.
NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition.

Alter Table/Partition Location

ALTER TABLE table_name [partitionSpec] SET LOCATION "new location"

Alter Table Touch

ALTER TABLE table_name TOUCH;
ALTER TABLE table_name TOUCH PARTITION partition_spec;

TOUCH reads the metadata, and writes it back. This has the effect of causing the pre/post execute hooks to fire. An example use case is if you have a hook that logs all the tables/partitions that were modified, along with an external script that alters the files on HDFS directly. Since the script modifies files outside of Hive, the modification wouldn't be logged by the hook. The external script could call TOUCH to fire the hook and mark the said table or partition as modified.
Also, it may be useful later if we incorporate reliable last modified times. Then touch would update that time as well.
Note that TOUCH doesn't create a table or partition if it doesn't already exist. (See Create Table)

3. Create/Drop View

Create View

CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...

CREATE VIEW creates a view with the given name. An error is thrown if a table or view with the same name already exists. Use IF NOT EXISTS to skip the error.
If no column names are supplied, the names of the view's columns will be derived automatically from the defining SELECT expression. (If the SELECT contains unaliased scalar expressions such as x+y, the resulting view column names will be generated in the form _C0, _C1, etc.) When renaming columns, column comments can also optionally be supplied. (Comments are not automatically inherited from underlying columns.)
A CREATE VIEW statement will fail if the view's defining SELECT expression is invalid.
Note that a view is a purely logical object with no associated storage. (No support for materialized views is currently available.) When a query references a view, the view's definition is evaluated in order to produce a set of rows for further processing by the query. This is a conceptual description; in fact, as part of query optimization, Hive may combine the view's definition with the query's, e.g. pushing filters from the query down into the view.
A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER.
A view may contain ORDER BY and LIMIT clauses. If a referencing query also contains these clauses, the query-level clauses are evaluated after the view clauses (and after any other operations in the query). For example, if a view specifies LIMIT 5, and a referencing query is executed as (SELECT * FROM v LIMIT 10), then at most 5 rows will be returned.
Example of view creation:

CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')
COMMENT 'Referrers to The Onion website'
AS
SELECT DISTINCT referrer_url
FROM page_view
WHERE page_url='http://www.theonion.com';

Drop View

DROP VIEW view_name

DROP VIEW removes metadata for the specified view. (It is illegal to use DROP TABLE on a view.)
When dropping a view referenced by other views, no warning is given (the dependent views are left dangling as invalid and must be dropped or recreated by the user).
Example:




DROP VIEW onion_referrers;



Show Tables

SHOW TABLES identifier_with_wildcards

SHOW TABLES lists all the base tables and views with names matching the given regular expression. Regular expression can contain only '*' for any character[s] or '|' for a choice. Examples are 'page_view', 'page_v*', '*view|page*', all which will match 'page_view' table. Matching tables are listed in alphabetical order. It is not an error if there are no matching tables found in metastore.

Show Partitions

SHOW PARTITIONS table_name

SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order.
It is also possible to specify parts of a partition specification to filter the resulting list.

SHOW PARTITIONS table_name PARTITION(ds='2010-03-03');
SHOW PARTITIONS table_name PARTITION(hr='12');
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12');

Show Table/Partitions Extended

SHOW TABLE EXTENDED [IN|FROM database_name] LIKE identifier_with_wildcards [PARTITION(partition_desc)]

SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users can not use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize,lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.

Show Functions

SHOW FUNCTIONS "a.*"

SHOW FUNCTIONS lists all the user defined and builtin functions matching the regular expression. To get all functions use ".*"

Describe Table/Column

DESCRIBE [EXTENDED] table_name[DOT col_name]
DESCRIBE [EXTENDED] table_name[DOT col_name ( [DOT field_name] | [DOT '$elem$'] | [DOT '$key$'] | [DOT '$value$'] )* ]

DESCRIBE TABLE shows the list of columns including partition columns for the given table. If the EXTENDED keyword is specified then it will show all the metadata for the table in Thrift serialized form. This is generally only useful for debugging and not for general use.
If a table has complex column then you can examine the attributes of this column by specifying table_name.complex_col_name (and '$elem$' for array element, '$key$' for map key, and '$value$' for map value). You can specify this recursively to explore the complex column type.
For a view, DESCRIBE TABLE EXTENDED can be used to retrieve the view's definition. Two relevant attributes are provided: both the original view definition as specified by the user, and an expanded definition used internally by Hive.

Describe Partition

DESCRIBE [EXTENDED] table_name partition_spec

This statement lists metadata for a given partition. The output is similar to that of DESCRIBE TABLE. Presently, the column information associated with a particular partition is not used while preparing plans.
Example:

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

Wednesday, 25 July 2012

Sqoop export and import commands

Sqoop Import Examples:

Sqoop Import :- Import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS) and its subprojects (Hive, HBase).

Import the data (MySQL table) to HBase:

Case 1: If table have primary key and import all the column of MySQL table into HBase table.

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table tableName --hbase-table hbase_tableName --column-family hbase_table_col1 --hbase-create-table

Case 2: If table have primary key and import only few columns of MySQL table into HBase table.

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table tableName --hbase-table hbase_tableName --columns column1,column2 --column-family hbase_table_col1 --hbase-create-table

Note : Column names specified in --columns attribute must contain the primary key column.

Case 3: If table doesn't have primary key then choose one column as a hbase-row-key. Import all the column of MySQL table into HBase table.

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table tableName --hbase-table hbase_tableName --column-family hbase_table_col1 --hbase-row-key column1 --hbase-create-table

Case 4: If table doesn't have primary key then choose one column as a hbase-row-key. Import only few columns of MySQL table into HBase table.

Note: Column name specified in hbase-row-key atribute must be in columns list. Otherwise command will execute successfully but no records are inserted into hbase.

Note : The value of primary key column or column specified in --hbase-row-key attribute become the HBase row value. If MySQL table doesn't have primary key or column specified in --hbase-row-key attribute doesn't have unique value then there is a lost of few records.

Example : Let us consider a MySQL table test_table which have two columns name,address. The table test_table doesn't have primary key or unique key column.

Records of test_table:
________________
name    address
----------------
sa    123
sb    234
sa    345
sc    456
sd    567

Run the following command to import test_table data into HBase:

$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root --table test_table --hbase-table hbase_test_table --column-family test_table_col1 --hbase-row-key name --hbase-create-table

Only 4 records are visible into HBase table instead of 5. In above example two rows have same value 'sa' of name column and value of this column is used as a HBase row key value. If record having value 'sa' of name column come again then those records will be inserted into HBase table. Next time, another record having the same value 'sa' of name column come then those column will overwrite the value of the previous column.

Above problem also occurs if table have composite primary key because the one column from composite key is used as a HBase row key.

Import the data (MySQL table) to Hive

Case 1: Import MySQL table into Hive if table have primary key.

bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --hive-table tableName --create-hive-table --hive-import --hive-home path/to/hive_home

Case 2: Import MySQL table into Hive if table doesn't have primary key.

$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --hive-table tableName --create-hive-table --hive-import --hive-home path/to/hive_home --split-by column_name

or

$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --hive-table tableName --create-hive-table --hive-import --hive-home path/to/hive_home -m 1

Import the data (MySQL table) to HDFS

Case 1: Import MySQL table into HDFS if table have primary key.

$ bin/sqoop import -connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --target-dir /user/ankit/tableName

Case 2: Import MySQL table into HDFS if table doesn't have primary key.

$ bin/sqoop import -connect jdbc:mysql://localhost:3306/db1 -username root -password password --table tableName --target-dir /user/ankit/tableName -m 1

Sqoop Export Examples:

Sqoop Export: export the HDFS and its subproject (Hive, HBase) data back into an RDBMS.

Export Hive table back to an RDBMS:

By default, Hive will stored data using ^A as a field delimiter and \n as a row delimiter.

$ bin/sqoop export --connect jdbc:mysql://localhost/test_db --table tableName --export-dir /user/hive/warehouse/tableName --username root --password password -m 1 --input-fields-terminated-by '\001'

where '\001' is octal representation of ^A.

Tuesday, 24 July 2012

Hadoop file system

fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. Each node in the system on which Hadoop is expected to operate needs to know the address of the NameNode. The DataNode instances will register with this NameNode, and make their data available through it. Individual client programs will connect to this address to retrieve the locations of actual file blocks.
dfs.data.dir - This is the path on the local file system in which the DataNode instance should store its data. It is not necessary that all DataNode instances store their data under the same local path prefix, as they will all be on separate machines; it is acceptable that these machines are heterogeneous. However, it will simplify configuration if this directory is standardized throughout the system. By default, Hadoop will place this under /tmp. This is fine for testing purposes, but is an easy way to lose actual data in a production system, and thus must be overridden.
dfs.name.dir - This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. It is only used by the NameNode instance to find its information, and does not exist on the DataNodes. The caveat above about /tmp applies to this as well; this setting must be overridden in a production system.
Another configuration parameter, not listed above, is dfs.replication. This is the default replication factor for each block of data in the file system. For a production cluster, this should usually be left at its default value of 3. (You are free to increase your replication factor, though this may be unnecessary and use more space than is required. Fewer than three replicas impact the high availability of information, and possibly the reliability of its storage.)

To move a file from local file system to hdfs

go to hadoop bin and execute
>hadoop dfs -put /usr/inputfile /user/outputdestination

:)

By default, HDFS does not preserve any free space on the DataNodes; the DataNode service will continue to accept blocks until all free space on the disk is exhausted, which may cause problems. The following setting will require each DataNode to reserve at least 1 GB of space on the drive free before it writes more blocks, which helps preserve system stability:

  
    dfs.datanode.du.reserved
    1073741824
    true

DataNodes can be configured to write blocks out to multiple disks via the dfs.data.dir property. It can take on a comma-separated list of directories. Each block is written to one of these directories. E.g., assuming that there are four disks, mounted on /d1, /d2, /d3, and /d4, the following (or something like it) should be in the configuration for each DataNode:

  
    dfs.data.dir
    /d1/dfs/data,/d2/dfs/data,/d3/dfs/data,/d4/dfs/data
    true

MapReduce performance can also be improved by distributing the temporary data generated by MapReduce tasks across multiple disks on each machine:

  
    mapred.local.dir
    /d1/mapred/local,/d2/mapred/local,/d3/mapred/local,/d4/mapred/local
    true

The NameNode is responsible for managing metadata associated with each block in the HDFS. As the amount of information in the rack scales into the 10's or 100's of TB, this can grow to be quite sizable. The NameNode machine needs to keep the blockmap in RAM to work efficiently. Therefore, at large scale, this machine will require more RAM than other machines in the cluster. The amount of metadata can also be dropped almost in half by doubling the block size:

  
    dfs.block.size
    134217728

This changes the block size from 64MB (the default) to 128MB, which decreases pressure on the NameNode's memory. On the other hand, this potentially decreases the amount of parallelism that can be achieved, as the number of blocks per file decreases. This means fewer hosts may have sections of a file to offer to MapReduce tasks without contending for disk access. The larger the individual files involved (or the more files involved in the average MapReduce job), the less of an issue this is.

HBase will lose data unless it is running on an HDFS that has a durable sync implementation. Hadoop 0.20.2, Hadoop 0.20.203.0, and Hadoop 0.20.204.0 DO NOT have this attribute. Currently only Hadoop versions 0.20.205.x or any release in excess of this version -- this includes hadoop 1.0.0 -- have a working, durable sync ^[6]. Sync has to be explicitly enabled by setting dfs.support.append equal to true on both the client side -- in hbase-site.xml -- and on the serverside in hdfs-site.xml (The sync facility HBase needs is a subset of the append code path).
dfs.support.append true

An Hadoop HDFS datanode has an upper bound on the number of files that it will serve at any one time. The upper bound parameter is called xcievers (yes, this is misspelled). Again, before doing any loading, make sure you have configured Hadoop's conf/hdfs-site.xml setting the xceivers value to at least the following:

      
        dfs.datanode.max.xcievers
        4096

You can run a ZooKeeper ensemble that comprises 1 node
                only but in production it is recommended that you run a
                ZooKeeper ensemble of 3, 5 or 7 machines; the more members an
                ensemble has, the more tolerant the ensemble is of host
                failures. Also, run an odd number of machines. In ZooKeeper, 
                an even number of peers is supported, but it is normally not used 
                because an even sized ensemble requires, proportionally, more peers 
                to form a quorum than an odd sized ensemble requires. For example, an 
                ensemble with 4 peers requires 3 to form a quorum, while an ensemble with 
                5 also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to 
                fail, and thus is more fault tolerant than the ensemble of 4, which allows 
                only 1 down peer.

Sunday, 22 July 2012

SQOOP Installation

After finishing hadoop and hbase installation go to apache downloads and remmebr to download sqoop versions from this list
sqoop-1.4.1-incubating__hadoop-0.20.tar.gz

> <http://apache.tradebit.com/pub/sqoop/1.4.1-incubating/sqoop-1.4.1-incubating__hadoop-0.20.tar.gz>
> sqoop-1.4.1-incubating__hadoop-0.23.tar.gz
> <http://apache.tradebit.com/pub/sqoop/1.4.1-incubating/sqoop-1.4.1-incubating__hadoop-0.23.tar.gz>
> sqoop-1.4.1-incubating__hadoop-1.0.0.tar.gz
> <http://apache.tradebit.com/pub/sqoop/1.4.1-incubating/sqoop-1.4.1-incubating__hadoop-1.0.0.tar.gz>
You ll get this error if you download it from original source file

Exception in thread "main" java.lang.

NoClassDefFoundError:
com/cloudera/sqoop/Sqoop
Caused by: java.lang.ClassNotFoundException: com.cloudera.sqoop.Sqoop
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: com.cloudera.sqoop.Sqoop. Program will exit.

The source tarball (sqoop-1.4.1-incubating-src.tar.gz<http://apache.tradebit.com/pub/sqoop/1.4.1-incubating/sqoop-1.4.1-incubating-src.tar.gz>) doesn't include Sqoop jar file, so you get that error.

Untar the sqoop to /usr/local/sqoop

[root@master ~]# chown hduser /usr/local/sqoop
[root@master ~]# chown hduser /usr/local/sqoop/conf
[root@master ~]# chown hduser /usr/local/sqoop/connector/mssql

[root@master ~]# chmod 755 /usr/local/sqoop/connector/mssql

[root@master ~]# chmod 755 /usr/local/sqoop/conf
[root@master ~]# chmod 755 /usr/local/sqoop/

download sqoop-sqlserver connector and copy to connectors folder
and run install.sh.Before that you should assign permissions to the folder to install the connector.
Copied hadoop-core-1.0.3.jar in sqoop lib
Copy sqoop-sqlserver-1.0.jar,mysql-connector-java-5.1.21-bin.jar in sqoop lib
Set the environment variables

MSSQL_CONNECTOR_HOME=/usr/local/sqoop/sqoop-sqlserver-1.0/
HADOOP_HOME=/usr/local/hadoop
SQOOP_CONF_DIR=/usr/local/sqoop/conf
SQOOP_HOME=/usr/local/sqoop
HBASE_HOME=/usr/local/hbase-0.92.1/
HADOOP_CLASSPATH=:/usr/local/sqoop/sqoop-1.4.1-incubating.jar

Copy sql jdbc connector (sqljdbc_4.0.2206.100_enu.tar.gz) and extract its two jdbc jars in

$SQOOP_HOME/lib

Now you are ready to execute import and export commands :)

go to the Downloads directory, unzip the tar and copy the jar fil to /usr/bin/sqoop/lib

$ cd Downloads (assumes starting from the home directory)
$ tar -zxvf sqljdbc_4.0.2206.100_enu.tar.gz (Check the version number you downloaded)
$ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib

Now we have the JDBC driver in place, we’re ready to install the SQL Server connector, assuming you downloaded it in the step above, unzip the tar, move the directory to /usr/local, set the environment variable and permissions then install it…

$ tar -zxvf sqoop-sqlserver-1.0.tar.gz
$ sudo mv sqoop-sqlserver-1.0 /usr/local
$ export MSSQL_CONNECTOR_HOME=/usr/local/sqoop-sqlserver-1.0

in the Install Hadoop example, we set up a Hadoop group – change the ownership of the sqoop directories now and you’ll save time fiddling around with permissions later.

$ sudo chown -R hduser:hadoop /usr/lib/sqoop
$ sudo chown -R hduser:hadoop /etc/sqoop

Now install the connector

$ cd /usr/local/sqoop-sqlserver-1.0
$ ./install.sh

Copy the jar files in hadoop lib as well otherwise i got some errors of connectivity

Enjoy sqooping :)

Tuesday, 17 July 2012

Step By Step installation of HBase

For hadoop installation check this out
http://biforbeginners.blogspot.in/2012/07/step-by-step-installation-of-hadoop-on.html

1)Extract HBase in /usr/local/hbase-0.92.1

$ cd /usr/local
$ sudo tar xzf hbase-0.92.1.tar.gz
$ sudo chown -R hduser hbase-0.92.1

2) In HBase-env.sh

For master and slaves,add or uncomment these

 export JAVA_HOME=/usr/java/jdk1.6.0_33
 export HBASE_CLASSPATH=/usr/local/hadoop/conf
 export HBASE_HEAPSIZE=1000

 export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
 export HBASE_MANAGES_ZK=true

3) In Hbase-site.xml

For master and slaves,add

 

      hbase.rootdir
      hdfs://master:54310/hbase
       
  
      hbase.cluster.distributed
      true

  
  
      hbase.zookeeper.property.clientPort
      2222
            
 
      hbase.master
      master:60000
      
    
 
      hbase.zookeeper.property.maxClientCnxns
      300

 
 
      dfs.support.append
      true

 
 
      hbase.zookeeper.quorum

      master

 
 
      hbase.zookeeper.property.dataDir
      /usr/local/hadoop/zookeeper/data

4) In usr/local/hbase-0.92.1/conf/regionservers,write

master

slave

5)Copy the hadoop-core-0.20.203.0.jar in

hadoop folder to the hbase/lib folder.

6)Copy commons-configuration-1.6.jar from

hadoop lib folder to hbase lib folder.


7)Create a directory /hadoop/zookeeper/data and insert myid in that

For master

mkdir -p /hadoop/zookeeper/data && echo '0' > /hadoop/zookeeper/data/myid

For Slaves

mkdir -p /hadoop/zookeeper/data && echo '10' > /hadoop/zookeeper/data/myid

Caution:Make sure you assign permissions to hadoop group and hduser to access this folder,else it will

show permission denied error.