Saturday 5 December 2015

Getting started with Apache Pig - Tutorial 1

What is Pig ? you might think that I am crazy asking this question. This is a kindergarten stuff.
But, believe me it is not.

Apache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the PigLatin scripting

language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily

achieved.

Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to

get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to

process the data stored on HDFS.

Pig was first built in Yahoo! and later became a top level Apache project. In this series of we will walk through the

different features of pig using a sample dataset.


Problem statement : Compute the highest runs by a player for each year.

Explanation of dataset : The file has all the batting statistics from 1871 to 2011 and it contains over 90000 rows.
We will extend the script tp transalate a player Id field into the first and last name of the player after we get the

highest runs.

Batting.csv - will give me player with highest runs
Mster.csv - will give his personal information (common element : palyer id)
Steps :

1. download a zip file and extract it to get the master.csv and batting.csv files. Get the zip file here.





2. Upload these two files into file browser of hue.





3. Go to Pig script console and give title to your script


4. Steps to write the pig script :

i.  Load the data
ii. Then we will filter the first row of the data
iii.We use FOREACH statement to iterate through the batting data object and generate selected fields and assgin

them names. The new data object we are creating is named as runs.
iv. We use GROUP statement to group elements in runs by the year field.So, we created new data object grp_data
v.  We will again use FOREACH statement to find maximum runs for each year.
vi. Now we have maximum runs. So, we join this with runs data object so we can pick up player Id.
vii.The result will be a dataset with "year","playerId" and "max runs".
viii. at the end we use dump statement to generate the output.

Please refer to the screenshot below for the code and results.




We can check in the job browser the progress of the job initiated.


Job is successfully completed and we can see the results.



We have got the intended results.


Check the query history to get successful confirmation.




I hope this blog was helpful to you and you got good feel of Apache Pig.

No comments:

Post a Comment