Friday 18 December 2015

Getting started with Hive - basic tutorial

Let us start this blog tutorial with knowing what is Hive.


Hive is a data warehousing package/infrastructure built on top of Hadoop.It provides an SQL dialect, called Hive Query Language(HQL) for querying data stored in a Hadoop cluster.HQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes. Hive doesn’t support transactions.So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence, Predictive modeling, hypothesis testing etc.
Hive is not designed for online transaction processing and does not offer real-time queries.

This tutorial is prepared for professionals who want to get started with Big Data Analytics using Hadoop Framework.

Getting Started
We will be using the same data that we used in our pig tutorial. Namely, files batting.csv and master.csv.
Data Source
Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111
Uploading Data
Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.
Once the files are uploaded they should look like this





Queries
The 1st step we do is create a temp_batting table with the query
create table temp_batting (col_value STRING);



To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.



Loading Batting.csv into temp_batting
Next we load the baseball scores into our temp table with the below code. Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

LOAD DATA INPATH '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;

SELECT * FROM temp_batting LIMIT 100;




Now we create a new table called BATTING  which will contain 3 columns namely (player_id, year and number of runs)
create table batting (player_id STRING, year INT, runs INT);



Table Batting columns


Now we extract the data we want from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.
The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.
insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;
Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Hope this tutorial helps you in running the HQL for calculating baseball scores.




1 comment: