Tuesday, 27 November 2012

Pig Basics

Pig raises the level of abstraction for processing large datasets. MapReduce allows you, as the programmer, to specify a map function followed by a reduce function, but working out how to fit your data processing into this pattern, which often requires multiple MapReduce stages, can be a challenge. With Pig, the data structures are much richer, typically being multivalued and nested, and the set of transformations you can apply to the data are much more powerful.


Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.


A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output. Taken as a whole, the operations describe a data flow, which the Pig execution environment translates into an executable representation and then runs. Under the covers, Pig turns the transformations into a series of MapReduce jobs

Installing and Running Pig

Download latest version of Pig from the following link (Pig Installation).
$ tar xzf pig-0.7.0.tar.gz
set pig environment variables
$ export PIG_INSTALL=/home/user1/pig-0.7.0.tar.gz
$ export PATH=$PATH:$PIG_INSTALL/bin
You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Pig has two execution types or modes: 

1) local mode : Pig runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets.

$ pig -x local

grunt>
This starts Grunt, the Pig interactive shell

2) MapReduce mode : In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster.


set the HADOOP_HOME environment variable for finding which Hadoop client to run.

$ pig  or $ pig -x mapreduce , runs pig in MapReduce mode
Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and MapReduce mode


Script : Pig can run a script file that contains Pig commands. For example, pig
script.pig runs the commands in the local file script.pig
$ pig script.pig
Grunt : Grunt is an interactive shell for running Pig commands.It is also possible to run Pig scripts from within Grunt using run and exec.


Embedded :
You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java.

PigPen is an Eclipse plug-in that provides an environment for developing Pig programs.

PigTools and EditorPlugins for pig can be downloaded from PigTools

Example of Pig in Interactive Mode (Grunt)

max_cgpa.pig


-- max_cgpa.pig: Finds the maximum cgpa of a user

records = LOAD 'pigsample.txt'
AS (name:chararray, spl:chararray, cgpa:float);
filtered_records = FILTER records BY cgpa > 0 AND cgpa < 10;
grouped_records = GROUP filtered_records BY spl;
max_cgpa = FOREACH grouped_records GENERATE group, MAX(filtered_records.cgpa);
STORE max_cgpa INTO 'output/cgpa_out';
Above pig script finds the maximum cgpa of a specialization.
pigsample.txt  ( Input to the pig )

raghu     ece     9
kumar    cse      8.5
biju       ece      8
mukul    cse      8.6
ashish   ece      7.0
subha    cse      8.3
ramu     ece     -8.3
rahul     cse      11.4
budania ece      5.4
first column represents name , second column specialization and third column is cgpa, by default each column is separated by tab space.

$ pig max_cgpa.pig
Output : 
(cse,8.6F)
(ece,9.0F)
Analysis : 
Statement : 1
records = LOAD 'pigsample.txt'AS (name:chararray, spl:chararray, cgpa:float);

Load input file in to memory from the file system (HDFS or local or Amazon S3). name:chararray notation describes the field’s
name and type; chararray is like a Java string, and an float is like a Java float.
grunt> DUMP records;
(raghu,ece,9.0F)
(kumar,cse,8.5F)
(biju,ece,8.0F)
(mukul,cse,8.6F)
(ashish,ece,7.0F)
(subha,cse,8.3F)
(ramu,ece,-8.3F)
(rahul,cse,11.4F)
(budania,ece,5.4F)
Input is converted in to a tuple , and each column is separated by ,
grunt> DESCRIBE records;
records: {name: chararray,spl: chararray,cgpa: float}
Statement : 2
filtered_records = FILTER records BY cgpa > 0 AND cgpa < 10;

grunt> DUMP filtered_records;
filter all the records whose cgpa <0 and="and" negative="negative">10 
(raghu,ece,9.0F)
(kumar,cse,8.5F)
(biju,ece,8.0F)
(mukul,cse,8.6F)
(ashish,ece,7.0F)
(subha,cse,8.3F)
(budania,ece,5.4F)
grunt> DESCRIBE filtered_records;
filtered_records: {name: chararray,spl: chararray,cgpa: float}
Statement : 3

The third statement uses the GROUP function to group the records relation by the specialization field.

grouped_records = GROUP filtered_records BY spl;
grunt> DUMP  grouped_records ;
(cse,{(kumar,cse,8.5F),(mukul,cse,8.6F),(subha,cse,8.3F)})
(ece,{(raghu,ece,9.0F),(biju,ece,8.0F),(ashish,ece,7.0F),(budania,ece,5.4F)})
grunt> DESCRIBE  grouped_records;
grouped_records: {group: chararray,filtered_records: {name: chararray,spl: chararray,cgpa: float}}
We now have two rows, or tuples, one for each specialization in the input data. The first field in each tuple is the field being grouped by (the specialization), and the second field is a bag of tuples
for that  specialization. A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.
By grouping the data in this way, we have created a row per  specialization , so now all that remains is to find the maximum cgpa for the tuples in each bag.

Statement : 4


max_cgpa = FOREACH grouped_records GENERATE group,
MAX(filtered_records.cgpa);
FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. In this example, the first field is group, which is just the specialization. The second field is a little more complex.

The filtered_records.cgpa reference is to the cgpa field of the
filtered_records bag in the grouped_records relation. MAX is a built-in function for calculating the maximum value of fields in a bag. In this case, it calculates the maximum cgpa for the fields in each filtered_records bag.

grunt> DUMP    max_cgpa  ;
(cse,8.6F)
(ece,9.0F)
grunt> DESCRIBE    max_cgpa  ;
max_cgpa : {group: chararray,float}
Statement : 5

STORE max_cgpa INTO 'output/cgpa_out'

This command redirects the output of the script to a file (Local or HDFS) instead of printing the output on the console .
we’ve successfully calculated the maximum cgpa for each specialization.
With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably complete and concise sample dataset.


--------------------------------------------------------------------
| records     | name: bytearray | spl: bytearray | cgpa: bytearray | 
--------------------------------------------------------------------
|             | kumar           | cse            | 8.5             | 
|             | mukul           | cse            | 8.6             | 
|             | ramu            | ece            | -8.3            | 
--------------------------------------------------------------------
----------------------------------------------------------------
| records     | name: chararray | spl: chararray | cgpa: float | 
----------------------------------------------------------------
|             | kumar           | cse            | 8.5         | 
|             | mukul           | cse            | 8.6         | 
|             | ramu            | ece            | -8.3        | 
----------------------------------------------------------------
-------------------------------------------------------------------------
| filtered_records     | name: chararray | spl: chararray | cgpa: float | 
-------------------------------------------------------------------------
|                      | kumar           | cse            | 8.5         | 
|                      | mukul           | cse            | 8.6         | 
-------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------
| grouped_records     | group: chararray | filtered_records: bag({name: chararray,spl: chararray,cgpa: float}) | 
----------------------------------------------------------------------------------------------------------------
|                     | cse              | {(kumar, cse, 8.5), (mukul, cse, 8.6)}                              | 
----------------------------------------------------------------------------------------------------------------
-------------------------------------------
|  max_cgpa   | group: chararray | float | 
-------------------------------------------
|              | cse              | 8.6   |
EXPLAIN max_cgpa  
Use the above command to see the logical and physical plans created by Pig.

No comments:

Post a Comment