Thursday, 26 July 2012

Hive tutorial

CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. Use IF NOT EXISTS to skip the error.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).
You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe. 
You must specify a list of a columns for tables that use a native SerDe. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
Use STORED BY to create a non-native table, for example in HBase. See Hive/StorageHandlers for more information on this option.
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can be used to improve performance on certain kinds of queries.
Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
Examples:
Here's an example statement to create a table:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 STORED AS SEQUENCEFILE;
The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;
The above statement lets you create the same table as the previous table.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE;
In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
In all the examples until now the data is stored in the Hive Metadata Store in the sub-directory page_view.
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '';
You can use the above statement to create a page_view table which points to any hdfs location for its storage. But you still have to make sure that the data is delimited as specified in the query above.
CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile AS
SELECT (KEY % 1024) new_key, concat(KEY, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;
The above CTAS statement creates the target table new_key_value_store with the schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the SELECT statement. If the SELECT statement does not specify column aliases, the column names will be automatically assigned to _col0, _col1, and _col2 etc. In addition, the new target table is created using a specific SerDe and a storage format independent of the source tables in the SELECT statement.
Before using this command, the table key_value_store can be created and loaded as follows:
CREATE TABLE key_value_store (
    KEY int,
    value string
)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;


LOAD DATA LOCAL INPATH 'file:///C:/Users/nkaushik/Desktop/key_value.txt'
INTO TABLE key_value_store;

The file key_value.txt is tab delimited and contains the following entries:

1    :value_1
2    :value_2
3    :value_3
4    :value_4
5    :value_5


CREATE TABLE empty_key_value_store
LIKE key_value_store;
In contrast, the statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.

Inserting Data Into Bucketed Tables

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table -- only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
There is also an example of creating and populating bucketed tables.


Drop Table

DROP TABLE table_name
DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current directory if Trash is configured. The metadata is completely lost.
  • Dropping an EXTERNAL table, data in the table will NOT be deleted from the file system.
  • When dropping a table referenced by views, no warning is given
    • the views are left dangling as invalid and must be dropped or recreated by the user
See the next section on ALTER TABLE for how to drop partitions.


Alter Table Statements

Alter table statements enable you to change the structure of an existing table. You can add columns/partitions, change serde, add table and SerDe properties, or rename the table itself.

Add Partitions

ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
 
partition_spec:
  : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
You can use ALTER TABLE ADD PARTITION to add partitions to a table. Partition values should be quoted only if they are strings.
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') 
              location '/path/to/us/part080808' 
              PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';



Drop Partitions

ALTER TABLE table_name DROP partition_spec, partition_spec,...
You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data and metadata for this partition.
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');


Rename Table

ALTER TABLE table_name RENAME TO new_table_name
This statement lets you change the name of a table to a different name.
Note: a rename on a managed table moves its HDFS location as well as changing the name in the metadata store.


Change Column Name/Type/Position/Comment

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
 

This command will allow users to change a column's name, data type, comment, or position, or an arbitrary combination of them.
Examples:
CREATE TABLE test_change (a int, b int, c int);
 
ALTER TABLE test_change CHANGE a a1 INT; 
    " will change column a's name to a1
 
ALTER TABLE test_change CHANGE a a1 STRING AFTER b;
    " will CHANGE COLUMN a's name to a1, a's DATA type TO string, AND put it after COLUMN b. 
    " The new TABLE's structure is: b int, a1 string, c int;
 
ALTER TABLE test_change CHANGE b b1 INT FIRST;
    " will change column b's name TO b1, AND put it AS the first COLUMN. 
    " The new TABLE's structure is: b1 int, a string, c int;

NOTE: The column change command will only modify Hive's metadata, and will NOT touch the actual data. Users should make sure the actual data layout conforms with the metadata definition.


Add/Replace Columns

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with native SerDe (DynamicSerDe or MetadataTypedColumnsetSerDe). REPLACE COLUMNS can also be used to drop columns. For example:
ALTER TABLE test_change (a int, b int);
      " will remove column 'c' from test_change's schema. 

Note that this does not delete underlying data, it just changes the meta data's schema.


Alter Table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_properties
 
table_properties:
  : (property_name = property_value, property_name = property_value, ... )
You can use this statement to add your own metadata to the tables. Currently last_modified_user, last_modified_time properties are automatically added and managed by Hive. Users can add their own properties to this list. You can do DESCRIBE EXTENDED TABLE to get this information.


Add Serde Properties

ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties]
ALTER TABLE table_name SET SERDEPROPERTIES serde_properties
 
serde_properties:
  : (property_name = property_value, property_name = property_value, ... )
This statement enables you to add user defined metadata to table SerDe object. The serde properties are passed to the table's SerDe when it is being initialized by Hive to serialize and deserialize data. So users can store any information required for their custom serde here. Refer to SerDe section of Users Guide for more information.


Alter Table File Format and Organization

ALTER TABLE table_name [partitionSpec] SET FILEFORMAT file_format;
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS
These statements change the table's physical storage properties. For available file_format options, see the section above on CREATE TABLE.
Alter file format can also apply on a partition.
NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition.


Alter Table/Partition Location

ALTER TABLE table_name [partitionSpec] SET LOCATION "new location"


Alter Table Touch

ALTER TABLE table_name TOUCH;
ALTER TABLE table_name TOUCH PARTITION partition_spec;
TOUCH reads the metadata, and writes it back. This has the effect of causing the pre/post execute hooks to fire. An example use case is if you have a hook that logs all the tables/partitions that were modified, along with an external script that alters the files on HDFS directly. Since the script modifies files outside of Hive, the modification wouldn't be logged by the hook. The external script could call TOUCH to fire the hook and mark the said table or partition as modified.
Also, it may be useful later if we incorporate reliable last modified times. Then touch would update that time as well.
Note that TOUCH doesn't create a table or partition if it doesn't already exist. (See Create Table)


3. Create/Drop View

Create View

CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
CREATE VIEW creates a view with the given name. An error is thrown if a table or view with the same name already exists. Use IF NOT EXISTS to skip the error.
If no column names are supplied, the names of the view's columns will be derived automatically from the defining SELECT expression. (If the SELECT contains unaliased scalar expressions such as x+y, the resulting view column names will be generated in the form _C0, _C1, etc.) When renaming columns, column comments can also optionally be supplied. (Comments are not automatically inherited from underlying columns.)
A CREATE VIEW statement will fail if the view's defining SELECT expression is invalid.
Note that a view is a purely logical object with no associated storage. (No support for materialized views is currently available.) When a query references a view, the view's definition is evaluated in order to produce a set of rows for further processing by the query. This is a conceptual description; in fact, as part of query optimization, Hive may combine the view's definition with the query's, e.g. pushing filters from the query down into the view.
A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER.
A view may contain ORDER BY and LIMIT clauses. If a referencing query also contains these clauses, the query-level clauses are evaluated after the view clauses (and after any other operations in the query). For example, if a view specifies LIMIT 5, and a referencing query is executed as (SELECT * FROM v LIMIT 10), then at most 5 rows will be returned.
Example of view creation:
CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')
COMMENT 'Referrers to The Onion website'
AS
SELECT DISTINCT referrer_url
FROM page_view
WHERE page_url='http://www.theonion.com';


Drop View

DROP VIEW view_name
DROP VIEW removes metadata for the specified view. (It is illegal to use DROP TABLE on a view.)
When dropping a view referenced by other views, no warning is given (the dependent views are left dangling as invalid and must be dropped or recreated by the user).
Example:
DROP VIEW onion_referrers;
Show Tables
SHOW TABLES identifier_with_wildcards
SHOW TABLES lists all the base tables and views with names matching the given regular expression. Regular expression can contain only '*' for any character[s] or '|' for a choice. Examples are 'page_view', 'page_v*', '*view|page*', all which will match 'page_view' table. Matching tables are listed in alphabetical order. It is not an error if there are no matching tables found in metastore.


Show Partitions

SHOW PARTITIONS table_name
SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order.
It is also possible to specify parts of a partition specification to filter the resulting list.
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03');
SHOW PARTITIONS table_name PARTITION(hr='12');
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12');




Show Table/Partitions Extended

SHOW TABLE EXTENDED [IN|FROM database_name] LIKE identifier_with_wildcards [PARTITION(partition_desc)]
SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users can not use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize,lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.


Show Functions

SHOW FUNCTIONS "a.*"
SHOW FUNCTIONS lists all the user defined and builtin functions matching the regular expression. To get all functions use ".*"


Describe Table/Column

DESCRIBE [EXTENDED] table_name[DOT col_name]
DESCRIBE [EXTENDED] table_name[DOT col_name ( [DOT field_name] | [DOT '$elem$'] | [DOT '$key$'] | [DOT '$value$'] )* ]

DESCRIBE TABLE shows the list of columns including partition columns for the given table. If the EXTENDED keyword is specified then it will show all the metadata for the table in Thrift serialized form. This is generally only useful for debugging and not for general use.
If a table has complex column then you can examine the attributes of this column by specifying table_name.complex_col_name (and '$elem$' for array element, '$key$' for map key, and '$value$' for map value). You can specify this recursively to explore the complex column type.
For a view, DESCRIBE TABLE EXTENDED can be used to retrieve the view's definition. Two relevant attributes are provided: both the original view definition as specified by the user, and an expanded definition used internally by Hive.


Describe Partition

DESCRIBE [EXTENDED] table_name partition_spec


This statement lists metadata for a given partition. The output is similar to that of DESCRIBE TABLE. Presently, the column information associated with a particular partition is not used while preparing plans.
Example:
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

3 comments:

  1. Good tutorial for Hive.Really hadoop serves you a good career.http://www.hadooponlinetutor.com is offering hadoop at cheap price

    ReplyDelete
  2. Is there a way to get the partition column name without describing the table?

    ReplyDelete
  3. Really nice blog post.provided a helpful information.I hope that you will post more updates like thisHadoop Admin Online Course

    ReplyDelete