Learn SAS by examples

Friday, 18 December 2015

Getting started with Hive - basic tutorial

Let us start this blog tutorial with knowing what is Hive.

Hive is a data warehousing package/infrastructure built on top of Hadoop.It provides an SQL dialect, called Hive Query Language(HQL) for querying data stored in a Hadoop cluster.HQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes. Hive doesn’t support transactions.So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence, Predictive modeling, hypothesis testing etc.

Hive is not designed for online transaction processing and does not offer real-time queries.

This tutorial is prepared for professionals who want to get started with Big Data Analytics using Hadoop Framework.

Getting Started

We will be using the same data that we used in our pig tutorial. Namely, files batting.csv and master.csv.

Data Source

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000

Pwd : 1111

Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Once the files are uploaded they should look like this

Queries

The 1st step we do is create a temp_batting table with the query

create table temp_batting (col_value STRING);

To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.

Loading Batting.csv into temp_batting

Next we load the baseball scores into our temp table with the below code. Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

LOAD DATA INPATH '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;

SELECT * FROM temp_batting LIMIT 100;

Now we create a new table called BATTING which will contain 3 columns namely (player_id, year and number of runs)

create table batting (player_id STRING, year INT, runs INT);

Table Batting columns

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Hope this tutorial helps you in running the HQL for calculating baseball scores.

Saturday, 5 December 2015

Getting started with Apache Pig - Tutorial 1

What is Pig ? you might think that I am crazy asking this question. This is a kindergarten stuff.
But, believe me it is not.

Apache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the PigLatin scripting

language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily

achieved.

Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to

get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to

process the data stored on HDFS.

Pig was first built in Yahoo! and later became a top level Apache project. In this series of we will walk through the

different features of pig using a sample dataset.

Problem statement : Compute the highest runs by a player for each year.

Explanation of dataset : The file has all the batting statistics from 1871 to 2011 and it contains over 90000 rows.
We will extend the script tp transalate a player Id field into the first and last name of the player after we get the

highest runs.

Batting.csv - will give me player with highest runs
Mster.csv - will give his personal information (common element : palyer id)
Steps :

1. download a zip file and extract it to get the master.csv and batting.csv files. Get the zip file here.

2. Upload these two files into file browser of hue.

3. Go to Pig script console and give title to your script

4. Steps to write the pig script :

i. Load the data
ii. Then we will filter the first row of the data
iii.We use FOREACH statement to iterate through the batting data object and generate selected fields and assgin

them names. The new data object we are creating is named as runs.
iv. We use GROUP statement to group elements in runs by the year field.So, we created new data object grp_data
v. We will again use FOREACH statement to find maximum runs for each year.
vi. Now we have maximum runs. So, we join this with runs data object so we can pick up player Id.
vii.The result will be a dataset with "year","playerId" and "max runs".
viii. at the end we use dump statement to generate the output.

Please refer to the screenshot below for the code and results.

We can check in the job browser the progress of the job initiated.

Job is successfully completed and we can see the results.

We have got the intended results.

Check the query history to get successful confirmation.

I hope this blog was helpful to you and you got good feel of Apache Pig.

Saturday, 21 November 2015

Hadoop - Word counting using Map-Reduce technique

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. MapReduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. This blog explains the nature of this programming model and how it can be used to write programs and run in the Hadoop environment.

This Blog is intended to give budding MapReduce developers a start off in developing and running a hadoop based application. It involves some tricks on hadoop MapReduce programming.The code samples provided here is tested on hadoop environment but still do post me if you find any not working.

Word count is a typical example where Hadoop map reduce developers start their hands on with. This sample map reduce is intended to count the no of occurrences of each word in the provided input files.

What are the minimum requirements?

1. Input text files – any text file

2. A VM

3. The mapper, reducer and driver classes to process the input files

How it works

The word count operation takes place in two stages a mapper phase and a reducer phase. In mapper phase first the test is tokenized into words then we form a key value pair with these words where the key being the word itself and value ‘1’. For example consider the sentence

“tring tring the phone rings”

In map phase the sentence would be split as words and form the initial key value pair as

<tring,1>

<the,1>

<phone,1>

<rings,1>

In the reduce phase the keys are grouped together and the values for similar keys are added. So here there are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be

<tring,2>

<the,1>

<phone,1>

<rings,1>

This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for keys.

The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts. Say if we have a total of 10 lines in our input files combined together, first the 10 lines are tokenized and key value pairs are formed in parallel, only after this the aggregation/ reducer would start its operation.

Let us define some goals for this blog.

Goals for this Module:

- Installing and accessing Hadoop

- Java programs for word count

- Compilation and execution of Java programs

Installing Hadoop

Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3. It can be mounted using Orcale VM Virtual Box. Just to let you know, you need atleast 6 Gb of RAM for it run decently. Once you start the virtual machine it should take about 2-3 minutes for the machine to be ready to be used.

Accessing Hadoop

Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once at this page you need to access the Secure Shell Client at 127.0.0.1:4200. The below screenshot will give you an idea if you are on the right path.

After Logging into SSH Client you will be need to give the login details.

You will need to make a directory WCclasses with the command

mkdir WCclasses

in the root directory. Then you need to use Vi Editor to create the 3 Java programs. WordMapper.java, SumReducer.java &WordCount.java. Below are the screenshots of the 3 programs to help you further.

Java Programs

WordMapper.java

WordCount.java

Sumreducer.java

After all 3 programs are in your root director you need to compile them using javac compile command. The class files will be in the WCclasses directory which were made earlier.

Compile and Execution

Commands to compile the java command are

javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java

This will compile all 3 java files and keep the compiled class files in the directory WCclasses.

The below command will create a JAR file in the directory WCclasses

Creation of JAR File

jar -cvf WordCount.jar -C WCclasses/ .

Once the jar file is created you need to create the input directory in the hdfs file system using the below command.

hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp

Loading files into HUE

Input of the txt files to be read into hue

Now you need to access HUE on 127.0.0.1:8000 to input the txt files to be read. Thankfully, this is a drag and drop and does not involve writing any commands.

Once the files are inputted you use the below code to run the final hadoop command.

Final Execution

hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out41

Notice that we have not made wc-out2. Hadoop will create the output directory by itself once the commands are run.

You can track your job at the address 127.0.0.1:8088 which lists the log of all jobs and the status.

Once the job comes as Finished & Succeeded we are on our way.

Directory for Checking final Output

We go to the directory that was created during execution of the program /user/ru1/wc-out41

File to be seen is part-r-00000 which contains the output.

OUTPUT

And that’s it, you can have a look at the output below. The file can be accessed from the wc-out41 directory in HUE.

We are successfully able to count the words in a text file. I hope this gives you a head start with Hadoop. I will come up with new posts in my upcoming blog. All the best with your Hadoop Journey.

Friday, 2 October 2015

Learn SAS by example: Chapter 16 - 20 solutions

In the last blog we learnt saw solutions to problems from chapters 07-15. This blog is a continuation of last blog which deals with some problems from chapter 16-20.

Let us start with couple of problems of chapter 16.

*16-1

Problem :

Using the SAS data set College, compute the mean, median, minimum, and
maximum and the number of both missing and non-missing values for the variables
ClassRank and GPA. Report the statistics to two decimal places

Solution explanation : Used proc means to calculate the aove stated descriptive statistics.

Output :

*16-5

Problem :

Using the SAS data set College, report the mean GPA for the following categories of
ClassRank: 0–50 = bottom half, 51–74 = 3rd quartile, and 75 to 100 = top
quarter. Do this by creating an appropriate format. Do not use a DATA step.

Solution explanation : Used proc format to assign the rank and then used format statement in proc means to list the ranks.

*17-7

Problem :

Using the SAS data set Blood, produce a table of frequencies for BloodType, in
frequency order.

Solution Explanation :- Used proc freq to list the records in descending order of frequency.
Table bloodtype statement is used to keep the blood type attribute with no cumulative frequencyy or percentage listed.

Output:

*18-7

Problem :

Produce the following table. Note that the keywords MIN and MAX have been
renamed and the two variables ClassRank and SchoolSize now have labels. A
procedure option was used to remove the horizontal table lines.

Solution explanation : Used proc tabulte to perform the desired action.Used tables options to calculate GPA
and label option to list different names.

*18-9

Problem :

Produce the following table. Note that the ALL column has been renamed Total,
COLPCTN has been renamed Percent, Gender has been formatted, and the order of
the columns is Yes followed by No.

Solution Explanation :- Firstly , used proc format to assign the value to gender attribute.
Ten using proc sort , sorted the records using scholarship in descending order.
Then used proc Tabulate using the procedure option ORDER=data. this is how we figure a way to place
the data set with Yes values before the No values.

Output :

*19-3

Problem :

Run the same two procedures shown in Problem 1, except create a contents file, a
body file, and a frame file.

Solution explanation : Using ODS option to send the output to html file
and then use proc means to list some descriptive statistics.

*20-7

Problem :

Using the SAS data set College, produce a vertical bar chart where the mean GPA is
shown for each value of school size (SchoolSize). Remember to include a
FMTSEARCH option, use the system option NOFMTERR, or write a format of your
own.

Solution explanation : used proc sgplot to get the vbar of mean of GPA of school size.
stat = mean option gives the mean. and fillattribute options selects the color.

*20-9

Problem :

Using the SAS data set Stocks, produce a scatter plot of Price versus Date.

Solution Explanation : using proc sgplot and option as scatter we achived our ojective.
x and y axis are specifiedafter scatter option.

I hope this blog has been of some learning experience. Thank you.