CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. Use IF NOT EXISTS to skip the error.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).
You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe.
You must specify a list of a columns for tables that use a native SerDe. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
Use STORED BY to create a non-native table, for example in HBase. See Hive/StorageHandlers for more information on this option.
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can be used to improve performance on certain kinds of queries.
Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
Examples:
Here's an example statement to create a table:
The
statement above creates the page_view table with viewTime, userid,
page_url, referrer_url, and ip columns (including comments). The table
is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
The above statement lets you create the same table as the previous table.
In the example above, the page_view
table is bucketed (clustered by) userid and within each bucket the
data is sorted in increasing order of viewTime. Such an organization
allows the user to do efficient sampling on the clustered column - in
this case userid. The sorting property allows internal operators to
take advantage of the better-known data structure while evaluating
queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS
keywords can be used if any of the columns are lists or maps.
In all the examples until now the data is stored in the Hive Metadata Store in the sub-directory page_view.
You can use the above statement to create a page_view
table which points to any hdfs location for its storage. But you still
have to make sure that the data is delimited as specified in the query
above.
The above CTAS statement creates the target table new_key_value_store
with the schema (new_key DOUBLE, key_value_pair STRING) derived from
the results of the SELECT statement. If the SELECT statement does not
specify column aliases, the column names will be automatically assigned
to _col0, _col1, and _col2 etc. In addition, the new target table is
created using a specific SerDe and a storage format independent of the
source tables in the SELECT statement.
Before using this command, the table key_value_store can be created and loaded as follows:
The file key_value.txt is tab delimited and contains the following entries:
In contrast, the statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.
There is also an example of creating and populating bucketed tables.
DROP TABLE
removes metadata and data for this table. The data is actually moved
to the .Trash/Current directory if Trash is configured. The metadata is
completely lost.
You can use ALTER TABLE ADD PARTITION to add partitions to a table. Partition values should be quoted only if they are strings.
You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data and metadata for this partition.
This statement lets you change the name of a table to a different name.
Note: a rename on a managed table moves its HDFS location as well as changing the name in the metadata store.
This command will allow users to change a column's name, data type, comment, or position, or an arbitrary combination of them. Examples:
NOTE: The column change command will only modify Hive's metadata, and will NOT touch the actual data. Users should make sure the actual data layout conforms with the metadata definition.
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with native SerDe (DynamicSerDe or MetadataTypedColumnsetSerDe). REPLACE COLUMNS can also be used to drop columns. For example:
Note that this does not delete underlying data, it just changes the meta data's schema.
You
can use this statement to add your own metadata to the tables.
Currently last_modified_user, last_modified_time properties are
automatically added and managed by Hive. Users can add their own
properties to this list. You can do DESCRIBE EXTENDED TABLE to get this information.
This statement enables you to add user defined metadata to table SerDe object. The serde properties are passed to the table's SerDe
when it is being initialized by Hive to serialize and deserialize data.
So users can store any information required for their custom serde
here. Refer to SerDe section of Users Guide for more information.
These statements change the table's physical storage properties. For available file_format options, see the section above on CREATE TABLE.
Alter file format can also apply on a partition.
NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition.
TOUCH reads
the metadata, and writes it back. This has the effect of causing the
pre/post execute hooks to fire. An example use case is if you have a
hook that logs all the tables/partitions that were modified, along with
an external script that alters the files on HDFS directly. Since the
script modifies files outside of Hive, the modification wouldn't be
logged by the hook. The external script could call TOUCH to fire the hook and mark the said table or partition as modified.
Also, it may be useful later if we incorporate reliable last modified times. Then touch would update that time as well.
Note that TOUCH doesn't create a table or partition if it doesn't already exist. (See Create Table)
CREATE VIEW creates a view with the given name. An error is thrown if a table or view with the same name already exists. Use IF NOT EXISTS to skip the error.
If no column names are supplied, the names of the view's columns will be derived automatically from the defining SELECT expression. (If the SELECT contains unaliased scalar expressions such as x+y, the resulting view column names will be generated in the form _C0, _C1, etc.) When renaming columns, column comments can also optionally be supplied. (Comments are not automatically inherited from underlying columns.)
A CREATE VIEW statement will fail if the view's defining SELECT expression is invalid.
Note that a view is a purely logical object with no associated storage. (No support for materialized views is currently available.) When a query references a view, the view's definition is evaluated in order to produce a set of rows for further processing by the query. This is a conceptual description; in fact, as part of query optimization, Hive may combine the view's definition with the query's, e.g. pushing filters from the query down into the view.
A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER.
A view may contain ORDER BY and LIMIT clauses. If a referencing query also contains these clauses, the query-level clauses are evaluated after the view clauses (and after any other operations in the query). For example, if a view specifies LIMIT 5, and a referencing query is executed as (SELECT * FROM v LIMIT 10), then at most 5 rows will be returned.
Example of view creation:
DROP VIEW removes metadata for the specified view. (It is illegal to use DROP TABLE on a view.)
When dropping a view referenced by other views, no warning is given (the dependent views are left dangling as invalid and must be dropped or recreated by the user).
Example:
SHOW TABLES
lists all the base tables and views with names matching the given
regular expression. Regular expression can contain only '*' for any
character[s] or '|' for a choice. Examples are 'page_view', 'page_v*',
'*view|page*', all which will match 'page_view' table. Matching tables
are listed in alphabetical order. It is not an error if there are no
matching tables found in metastore.
SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical order.
It is also possible to specify parts of a partition specification to filter the resulting list.
SHOW TABLE EXTENDED
will list information for all tables matching the given regular
expression. Users can not use regular expression for table name if a
partition specification is present. This command's output includes
basic table information and file system information like
totalNumberFiles, totalFileSize, maxFileSize,
minFileSize,lastAccessTime, and lastUpdateTime. If partition is present,
it will output the given partition's file system information instead
of table's file system information.
SHOW FUNCTIONS lists all the user defined and builtin functions matching the regular expression. To get all functions use ".*"
DESCRIBE TABLE shows the list of columns including partition columns for the given table. If the EXTENDED keyword
is specified then it will show all the metadata for the table in
Thrift serialized form. This is generally only useful for debugging and
not for general use.
If a table has complex column then you can examine the attributes of this column by specifying table_name.complex_col_name (and '$elem$' for array element, '$key$' for map key, and '$value$' for map value). You can specify this recursively to explore the complex column type.
For a view, DESCRIBE TABLE EXTENDED can be used to retrieve the view's definition. Two relevant attributes are provided: both the original view definition as specified by the user, and an expanded definition used internally by Hive.
This statement lists metadata for a given partition. The output is similar to that of DESCRIBE TABLE. Presently, the column information associated with a particular partition is not used while preparing plans.
Example:
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data).
You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files. Use the SERDE clause to create a table with custom SerDe.
You must specify a list of a columns for tables that use a native SerDe. A list of columns for tables that use a custom SerDe may be specified but Hive will query the SerDe to determine the actual list of columns for this table.
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
Use STORED BY to create a non-native table, for example in HBase. See Hive/StorageHandlers for more information on this option.
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can be used to improve performance on certain kinds of queries.
Table names and column names are case insensitive but SerDe and property names are case sensitive. Table and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs.
Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table cannot be a partitioned table (nor can it be an external table).
Examples:
Here's an example statement to create a table:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
In all the examples until now the data is stored in the Hive Metadata Store in the sub-directory page_view.
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '' ;
CREATE TABLE new_key_value_store
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFile AS
SELECT (KEY % 1024) new_key, concat(KEY, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;
Before using this command, the table key_value_store can be created and loaded as follows:
CREATE TABLE key_value_store (
KEY int,
value string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'file:///C:/Users/nkaushik/Desktop/key_value.txt'
INTO TABLE key_value_store;
The file key_value.txt is tab delimited and contains the following entries:
1 :value_1
2 :value_2
3 :value_3
4 :value_4
5 :value_5
CREATE TABLE empty_key_value_store
LIKE key_value_store;
Inserting Data Into Bucketed Tables
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table -- only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.There is also an example of creating and populating bucketed tables.
Drop Table
DROP TABLE table_name
- Dropping an EXTERNAL table, data in the table will NOT be deleted from the file system.
- When dropping a table referenced by views, no warning is given
- the views are left dangling as invalid and must be dropped or recreated by the user
Alter Table Statements
Alter table statements enable you to change the structure of an existing table. You can add columns/partitions, change serde, add table and SerDe properties, or rename the table itself.Add Partitions
ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us')
location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
Drop Partitions
ALTER TABLE table_name DROP partition_spec, partition_spec,...
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');
Rename Table
ALTER TABLE table_name RENAME TO new_table_name
Note: a rename on a managed table moves its HDFS location as well as changing the name in the metadata store.
Change Column Name/Type/Position/Comment
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
This command will allow users to change a column's name, data type, comment, or position, or an arbitrary combination of them. Examples:
CREATE TABLE test_change (a int, b int, c int);
ALTER TABLE test_change CHANGE a a1 INT;
" will change column a's name to a1
ALTER TABLE test_change CHANGE a a1 STRING AFTER b;
" will CHANGE COLUMN a's name to a1, a's DATA type TO string, AND put it after COLUMN b.
" The new TABLE's structure is: b int, a1 string, c int;
ALTER TABLE test_change CHANGE b b1 INT FIRST;
" will change column b's name TO b1, AND put it AS the first COLUMN.
" The new TABLE's structure is: b1 int, a string, c int;
NOTE: The column change command will only modify Hive's metadata, and will NOT touch the actual data. Users should make sure the actual data layout conforms with the metadata definition.
Add/Replace Columns
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with native SerDe (DynamicSerDe or MetadataTypedColumnsetSerDe). REPLACE COLUMNS can also be used to drop columns. For example:
ALTER TABLE test_change (a int, b int);
" will remove column 'c' from test_change's schema.
Note that this does not delete underlying data, it just changes the meta data's schema.
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties
table_properties:
: (property_name = property_value, property_name = property_value, ... )
Add Serde Properties
ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties]
ALTER TABLE table_name SET SERDEPROPERTIES serde_properties
serde_properties:
: (property_name = property_value, property_name = property_value, ... )
Alter Table File Format and Organization
ALTER TABLE table_name [partitionSpec] SET FILEFORMAT file_format;
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS
Alter file format can also apply on a partition.
NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition.
Alter Table/Partition Location
ALTER TABLE table_name [partitionSpec] SET LOCATION "new location"
Alter Table Touch
ALTER TABLE table_name TOUCH;
ALTER TABLE table_name TOUCH PARTITION partition_spec;
Also, it may be useful later if we incorporate reliable last modified times. Then touch would update that time as well.
Note that TOUCH doesn't create a table or partition if it doesn't already exist. (See Create Table)
3. Create/Drop View
Create View
CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
If no column names are supplied, the names of the view's columns will be derived automatically from the defining SELECT expression. (If the SELECT contains unaliased scalar expressions such as x+y, the resulting view column names will be generated in the form _C0, _C1, etc.) When renaming columns, column comments can also optionally be supplied. (Comments are not automatically inherited from underlying columns.)
A CREATE VIEW statement will fail if the view's defining SELECT expression is invalid.
Note that a view is a purely logical object with no associated storage. (No support for materialized views is currently available.) When a query references a view, the view's definition is evaluated in order to produce a set of rows for further processing by the query. This is a conceptual description; in fact, as part of query optimization, Hive may combine the view's definition with the query's, e.g. pushing filters from the query down into the view.
A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible fashion, subsequent attempts to query the invalid view will fail.
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER.
A view may contain ORDER BY and LIMIT clauses. If a referencing query also contains these clauses, the query-level clauses are evaluated after the view clauses (and after any other operations in the query). For example, if a view specifies LIMIT 5, and a referencing query is executed as (SELECT * FROM v LIMIT 10), then at most 5 rows will be returned.
Example of view creation:
CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')
COMMENT 'Referrers to The Onion website'
AS
SELECT DISTINCT referrer_url
FROM page_view
WHERE page_url='http://www.theonion.com';
Drop View
DROP VIEW view_name
When dropping a view referenced by other views, no warning is given (the dependent views are left dangling as invalid and must be dropped or recreated by the user).
Example:
Show TablesDROP VIEW onion_referrers;
SHOW TABLES identifier_with_wildcards
Show Partitions
SHOW PARTITIONS table_name
It is also possible to specify parts of a partition specification to filter the resulting list.
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03');
SHOW PARTITIONS table_name PARTITION(hr='12');
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12');
Show Table/Partitions Extended
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE identifier_with_wildcards [PARTITION(partition_desc)]
Show Functions
SHOW FUNCTIONS "a.*"
Describe Table/Column
DESCRIBE [EXTENDED] table_name[DOT col_name]
DESCRIBE [EXTENDED] table_name[DOT col_name ( [DOT field_name] | [DOT '$elem$'] | [DOT '$key$'] | [DOT '$value$'] )* ]
If a table has complex column then you can examine the attributes of this column by specifying table_name.complex_col_name (and '$elem$' for array element, '$key$' for map key, and '$value$' for map value). You can specify this recursively to explore the complex column type.
For a view, DESCRIBE TABLE EXTENDED can be used to retrieve the view's definition. Two relevant attributes are provided: both the original view definition as specified by the user, and an expanded definition used internally by Hive.
Describe Partition
DESCRIBE [EXTENDED] table_name partition_spec
This statement lists metadata for a given partition. The output is similar to that of DESCRIBE TABLE. Presently, the column information associated with a particular partition is not used while preparing plans.
Example:
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');
Good tutorial for Hive.Really hadoop serves you a good career.http://www.hadooponlinetutor.com is offering hadoop at cheap price
ReplyDeleteIs there a way to get the partition column name without describing the table?
ReplyDeleteReally nice blog post.provided a helpful information.I hope that you will post more updates like thisHadoop Admin Online Course
ReplyDelete