Friday, 18 December 2015

Getting started with Hive - basic tutorial

Let us start this blog tutorial with knowing what is Hive.


Hive is a data warehousing package/infrastructure built on top of Hadoop.It provides an SQL dialect, called Hive Query Language(HQL) for querying data stored in a Hadoop cluster.HQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes. Hive doesn’t support transactions.So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence, Predictive modeling, hypothesis testing etc.
Hive is not designed for online transaction processing and does not offer real-time queries.

This tutorial is prepared for professionals who want to get started with Big Data Analytics using Hadoop Framework.

Getting Started
We will be using the same data that we used in our pig tutorial. Namely, files batting.csv and master.csv.
Data Source
Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111
Uploading Data
Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.
Once the files are uploaded they should look like this





Queries
The 1st step we do is create a temp_batting table with the query
create table temp_batting (col_value STRING);



To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.



Loading Batting.csv into temp_batting
Next we load the baseball scores into our temp table with the below code. Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

LOAD DATA INPATH '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;

SELECT * FROM temp_batting LIMIT 100;




Now we create a new table called BATTING  which will contain 3 columns namely (player_id, year and number of runs)
create table batting (player_id STRING, year INT, runs INT);



Table Batting columns


Now we extract the data we want from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.
The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.
insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;
Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Hope this tutorial helps you in running the HQL for calculating baseball scores.




Saturday, 5 December 2015

Getting started with Apache Pig - Tutorial 1

What is Pig ? you might think that I am crazy asking this question. This is a kindergarten stuff.
But, believe me it is not.

Apache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the PigLatin scripting

language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily

achieved.

Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to

get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to

process the data stored on HDFS.

Pig was first built in Yahoo! and later became a top level Apache project. In this series of we will walk through the

different features of pig using a sample dataset.


Problem statement : Compute the highest runs by a player for each year.

Explanation of dataset : The file has all the batting statistics from 1871 to 2011 and it contains over 90000 rows.
We will extend the script tp transalate a player Id field into the first and last name of the player after we get the

highest runs.

Batting.csv - will give me player with highest runs
Mster.csv - will give his personal information (common element : palyer id)
Steps :

1. download a zip file and extract it to get the master.csv and batting.csv files. Get the zip file here.





2. Upload these two files into file browser of hue.





3. Go to Pig script console and give title to your script


4. Steps to write the pig script :

i.  Load the data
ii. Then we will filter the first row of the data
iii.We use FOREACH statement to iterate through the batting data object and generate selected fields and assgin

them names. The new data object we are creating is named as runs.
iv. We use GROUP statement to group elements in runs by the year field.So, we created new data object grp_data
v.  We will again use FOREACH statement to find maximum runs for each year.
vi. Now we have maximum runs. So, we join this with runs data object so we can pick up player Id.
vii.The result will be a dataset with "year","playerId" and "max runs".
viii. at the end we use dump statement to generate the output.

Please refer to the screenshot below for the code and results.




We can check in the job browser the progress of the job initiated.


Job is successfully completed and we can see the results.



We have got the intended results.


Check the query history to get successful confirmation.




I hope this blog was helpful to you and you got good feel of Apache Pig.