Learn SAS by examples: Getting started with Hive

Let us start this blog tutorial with knowing what is Hive.

Hive is a data warehousing package/infrastructure built on top of Hadoop.It provides an SQL dialect, called Hive Query Language(HQL) for querying data stored in a Hadoop cluster.HQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes. Hive doesn’t support transactions.So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence, Predictive modeling, hypothesis testing etc.

Hive is not designed for online transaction processing and does not offer real-time queries.

This tutorial is prepared for professionals who want to get started with Big Data Analytics using Hadoop Framework.

Getting Started

We will be using the same data that we used in our pig tutorial. Namely, files batting.csv and master.csv.

Data Source

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000

Pwd : 1111

Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Once the files are uploaded they should look like this

Queries

The 1st step we do is create a temp_batting table with the query

create table temp_batting (col_value STRING);

To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.

Loading Batting.csv into temp_batting

Next we load the baseball scores into our temp table with the below code. Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

LOAD DATA INPATH '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;

SELECT * FROM temp_batting LIMIT 100;

Now we create a new table called BATTING which will contain 3 columns namely (player_id, year and number of runs)

create table batting (player_id STRING, year INT, runs INT);

Table Batting columns

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Hope this tutorial helps you in running the HQL for calculating baseball scores.

Learn SAS by examples

Friday, 18 December 2015

Getting started with Hive - basic tutorial

1 comment: